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1 Automatic parallelization: Automatic multithreading and multiprocessing of C 
^ programs for IXP 

^ Long Li, Bo Huang, Jinquan Dai, Luddy Harrison 

June 2005 Proceedings of the tenth ACM SIGPLAN symposium on Principles and 
practice of parallel programming PPoPP '05 

Publisher: ACM Press 

Full text available: ^ pdf(634.36 KB) Additional Information: full citation , abstract , references , index terms 

Effective compilation of packet processing applications onto the Intel IXP network 
processors requires, among other things, the automatic use of multiple threads on one or 
more processing elements, and the automatic introduction of synchronization as required 
to correctly enforce dependences between such threads. We describe the program 
transformation that is used in the Intel Auto- partitioning C Compiler for IXP to 
automatically multlthread/multi-process a program for the IXP. This transformati ... 

Keywords: code motion, critical section, multl- processing, multi-threading, network 
processor 



Effective thread management on network processors with compiler analysis 

Xiaotong Zhuang, Santosh Pande 

June 2006 ACM SIGPLAN Notices , Proceedings of the 2006 ACM SIGPLAN/SIGBED 
conference on Language, compilers and tool support for embedded 

systems LCTES '06, Volume 41 Issue 7 
Publisher: ACM Press 

Full text available: pdf(466.13 KB) Additional Information: full citation , abstract , references , index terms 

Mapping packet processing tasks on network processor micro-engines involves complex 
tradeoffs that relating to maximizing parallelism and pipelining. Due to an increase in the 
size of the code store and complexity of the application requirements, network processors 
are being programmed with heterogeneous threads that may execute code belonging to 
different tasks on a given micro-engine. Also, most network applications are streaming 
applications that are typically processed in a pipelined fashion ... 

Keywords: CPU scheduling, compiler optimizations, network processors, real-time 
scheduling 



3 GPGPU: general purpose computation on graphics hardware 

David Luebke, Mark Harris, Jens KrQger, Tim Purcell, Naga Govlndaraju, Ian Buck, Cliff 
^ Woolley, Aaron Lefohn 

August 2004 ACM SIGGRAPH 2004 Course Notes SIGGRAPH '04 
Publisher: ACM Press 

Full text available: ^ pdf(63.03 MB) Additional Information: full citation , abstract , citings 



The graphics processor (GPU) on today's commodity video cards has evolved into an 
extremely powerful and flexible processor. The latest graphics architectures provide 
tremendous memory bandwidth and computational horsepower, with fully programmable 
vertex and pixel processing units that support vector operations up to full IEEE floating 
point precision. High level languages have emerged for graphics hardware, making this 
computational power accessible. Architecturally, GPUs are highly parallel s ... 

* Exploiting perception in high-fidelity virtual environments: Exploiting perception in 
^ high-fidelity virtual environments 

^ Additional presentations from the 24th course are available on the citation 
page 

l^ashhuda Glencross, Alan G. Chalmers, Ming C. Lin, Miguel A. Otaduy, Diego Gutierrez 
July 2006 ACM SIGGRAPH 2006 Courses SIGGRAPH '06 

Publisher: ACM Press 

Full text available: ^ pdf(5.07 MB) 0 Additional Information: full citation , abstract , references 
movf68:6 MIN) 

The objective of this course is to provide an introduction to the issues that must be 
considered when building high-fidelity 3D engaging shared virtual environments. The 
principles of human perception guide important development of algorithms and 
techniques in collaboration, graphical, auditory, and haptic rendering. We aim to show 
how human perception is exploited to achieve realism in high fidelity environments within 
the constraints of available finite computational resources. In this course w ... 

Keywords: collaborative environments, haptics, high-fidelity rendering, human-computer 
interaction, multi-user, networked applications, perception, virtual reality 



5 Real-time shading 

Marc Olano, Kurt Akeley, John C. Hart, Wolfgang Heidrich, Michael McCool, Jason L Mitchell, 
Randi Rost 

August 2004 ACM SIGGRAPH 2004 Course Notes SIGGRAPH '04 
Publisher: ACM Press 

Full text available: ^ pdf(7.39 MB) Additional Information: full citation , abstract 

Real-time procedural shading was once seen as a distant dream. When the first version of 
this course was offered four years ago, real-time shading was possible, but only with one- 
of-a-kind hardware or by combining the effects of tens to hundreds of rendering passes. 
Today, almost every new computer comes with graphics hardware capable of interactively 
executing shaders of thousands to tens of thousands of instructions. This course has been 
redesigned to address today's real-time shading capabill ... 

Posters: Latency hiding through multithreading on a network processor 
Xiaofeng Guo, Jinquan Dai, Long Li, Zhiyuan Lv, Prashant R. Chandra 
March 2007 Proceedings of the 12th ACM SIGPLAN symposium on Principles and 

practice of parallel programming PPoPP *07 
Publisher: ACM Press 

Full text available: g pdf(321.71 KB) Additional Information: full citation , index terms 



Keywords: code motion, compiler, latency hiding, multicore 



7 How to solve the current memory access and data transfer bottlenecks: at the 
^ processor architecture or at the compiler level 

Francky Catthoor, NIkil D. Dutt, Chrlstoforos E. Kozyrakis 

January 2000 Proceedings of the conference on Design, automation and test in 

Europe DATE '00 

Publisher: ACM Press 

Full text available: P| pdff92.68 KB) 

MT Additional Information: full citation , reference? , citipgs. index terms 

^ Publisher Site 



A dynamic optimization framework for a Java just-in-time compiler 

Toshio Suganuma, Toshiaki Yasue, Motohiro Kawahito, Hideaki Komatsu, Toshio Nakatani 

October 2001 ACM SIGPLAN Notices , Proceedings of the 16th ACIA SIGPLAN 

conference on Object oriented programming, systems, languages, and 

applications OOPSLA '01, volume 36 issue ll 

Publisher: ACM Press 

I- II * ^ 1 ui isi^ ^/o An ^iiD\ Additional Information: full citation , abstract , references , citings , index 
Full text available: t?: 1 001(2.12 IvlB) . 

^'^^^^•^'"^ terms 

The high performance implementation of Java Virtual Machines (JVM) and just-in-time 
(JIT) compilers is directed toward adaptive compilation optimizations on the basis of 
online runtime profile information. This paper describes the design and implementation of 
a. dynamic optimization framework in a production-level Java JIT compiler. Our approach 
is to employ a mixed mode interpreter and a three level optimizing compiler, supporting 
quick, full, and special optimization, each of which has a differ ... 

Balancing register allocation across threads for a multithreaded network processor 
Xiaotong Zhuang, Santosh Pande 

June 2004 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 2004 conference 
on Programming language design and implementation PLDI '04, volume 39 

Issue 6 
Publisher: ACM Press 

I- 11 ♦ ^ 1 ui 0t ^/>ioQ DC KDv Additional Information: full citation , abstract , references , citings , index 
Full text available; pdf(429.85 KB) 

Modern network processors employ multi-threading to allow concurrency amongst 
multiple packet processing tasks. We studied the properties of applications running on the 
network processors and observed that their imbalanced register requirements across 
different threads at different program points could lead to poor performance. Many times 
application needs demand some threads to be more performance critical than others and 
thus by controlling the register allocation across threads one could impa ... 

Keywords: multithreaded processor, network processor, register allocation 



10 Multithreading I: Master/slave speculative parallelization 
Craig Zilles, Gurindar Sohi 

November 2002 Proceedings of the 35th annual ACM/IEEE international symposium 

on Microarchitecture MICRO 35 
Publisher: IEEE Computer Society Press 

Full text available: fg ^p^f^^ 3^ mB)^ Additional Infomnation: full citation , abstract , references , citings , index 
Publisher Site 

Master/Slave Speculative Parallelization (MSSP) is an execution paradigm for improving 
the execution rate of sequential programs by parallelizing them speculatively for 
execution on a multiprocessor. In MSSP, one processor— the master— executes an 
approximate version of the program to compute selected values that the full program's 
execution is expected to compute. The master's results are checked by slave processors 
that execute the original program. This validation is parallelized by cutting ... 

11 Compilers: A framework for reducing instruction scheduling overhead in dynamic 
^ compilers 

^ Vikki Tang, Joran Siu, Alexander Vasilevskiy, Marcel Mitran 

October 2006 Proceedings of the 2006 conference of the Center for Advanced Studies 

on Collaborative research CASCON '06 
Publisher: ACM Press 

Full text available: 8^^39.39 KB) information: full citation , abstract , references 

Start-up time is a serious concern for high-availability applications such as web servers, 
transaction managers, and batch processes. Compilation time contributes directly to start- 
up costs in dynamic compilers. Up to 30% of compilation time Is spent scheduling 



instructions in the IBI^® Testarossa just-in-time compiler. In tliis paper, we describe a 
scheduling framework that reduces scheduling overhead by up to 61% with little to no 
degradation In throughput performance. By combining online pr ... 



Evaluating titanium SPMD programs on the Tera MTA 
Carleton Miyamoto, Chang Lin 

January 1999 Proceedings of the 1999 ACM/IEEE conference on Supercomputing 

(CDROM) Supercomputing '99 
Publisher: ACM Press 

Full text available: ^pdf(314.08 KB) Additional Information: full citation , references , index terms 



Array SSA form and its use in parallelization 
Kathleen Knobe, Vivek Sarkar 

January 1998 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on 

Principles of programming languages POPL '98 
Publisher: ACM Press 

Full text available: ^pdf(1.99 MB) Additional Information: full citation , references , citings , index terms 



Making Sequential Consistency Practical in Titanium 
Amir Kamil, Jimmy Su, Katherine Yelick 

November 2005 Proceedings of the 2005 ACM/IEEE conference on Supercomputing SC 
*05 

Publisher: IEEE Computer Society 

Full text available:'^ pdf(382.47 KB) Additional Information: full citation , abstract , index terms 

The memory consistency model in shared memory parallel programming controls the 
order in which memory operations performed by one thread may be observed by another. 
The most natural model for programmers is to have memory accesses appear to take 
effect in the order specified in the original program. Language designers have been 
reluctant to use this strong semantics, called sequential consistency, due to concerns over 
the performance of memory fence I nstructlons and related mechanisms that guara ... 

Object and native code thread nriobility among heterogeneous computers (includes 
sources) 

B. Steensgaard, E, Jul 

December 1995 ACM SIGOPS Operating Systems Review , Proceedings of the fifteenth 
ACM symposium on Operating systems principles SOSP '95, volume 29 

Issue 5 

Publisher: ACM Press 

Full text available: ^ pdf(1.50 MB) Additional Information: full citation , references , citings , index terms 



An Application-Based Performance Characterization of the Columbia Supercluster 
Rupak Biswas, M. Jahed Djomehri, Robert Hood, Haoqiang Jin, Cetin Kirls, Subhash SainI 
November 2005 Proceedings of the 2005 ACM/IEEE conference on Supercomputing SC 
•05 

Publisher: IEEE Computer Society 

Full text available: ^ pdf(776.50 KB) Additional Information: full citation , abstract . Index terms 

Columbia is a 10,240-processor supercluster consisting of 20 Altix nodes with 512 
processors each, and currently ranked as one of the fastest computers in the world. In 
this paper, we present the performance characteristics of Columbia obtained on up to four 
computing nodes interconnected via the InfiniBand and/or NUMAIink4 communication 
fabrics. We evaluate floatingpoint performance, memory bandwidth, message passing 
communication speeds, and compilers using a subset of the HPC Challenge benchm ... 

Keyworjds: SGI Altix, multi-level parallelism, HPC Challenge benchmarks, NAS Parallel 
Benchmarks, molecular dynamics, multi-block overset grids, computational fluid dynamics 



17 Shangri-La: achieving high performance from compiled network applications while 
^ enabling ease of programming 

^ Michael K. Chen, Xiao Feng Li, Ruiqi Lian, Jason H. Lin, Lixia Liu, Tao Liu, Roy Ju 

June 2005 ACM SIGPLAN Notices , Proceedings of the 2005 ACM SIGPLAN conference 
on Programming language design and implementation PLDI '05, volume 40 

Issue 6 

Publisher: ACM Press 

Full text available- pclf(480 93 KB) A^^**'^"^* Information: full citation , abstract , references , citings , index 
^ terms 

Programming network processors is challenging. To sustain high line rates, network 
processors have extremely tight memory access and instruction budgets. Achieving 
desired performance has traditionally required hand-coded assembly. Researchers have 
recently proposed high-level programming languages for packet processing, but the 
challenges of compiling these languages into code that is competitive with hand-tuned 
assembly remain unanswered.This paper describes the Shangri-La compiler, which 
acce ... 

Keywords: chip multiprocessors, dataflow programming, network processors, packet 
processing, program partitioning, throughput-oriented computing 

18 Session S6.2: compilers and program analysis: Experience with a retargetable 
A compiler for a commercial network processor 

^ Jinhwan Kim, Sungjoon Jung, Yunheung Paek, Gang-Ryung Uh 

October 2002 Proceedings of the 2002 international conference on Compilers, 

architecture, and synthesis for embedded systems CASES '02 
Publisher: ACM Press 

Full text available- W\ Ddf(275 87 KB) Information: full citation , abstract , references , citings , index 

IM^^ '■ terms 

The Palon PPII network processor is designed to meet the growing need for new high 
bandwidth network equipment. In order to rapidly reconfigure the processor for frequently 
varying internet services and technologies, a high performance compiler is urgently 
needed. Albeit various code generation techniques have been proposed for DSPs or ASIPs, 
we experienced these techniques are not easily tailored towards the target Paion PPII 
processor due to striking architectural differences. First, we will s ... 

Keywords: compiler, network processor, non-orthogonal architecture 



19 The interactive performance of SLIM: a stateless, thin-client architecture 
Brian K. Schmidt, Monica S. Lam, J. Duane Northcutt 

^ December 1999 ACM SIGOPS Operating Systems Review , Proceedings of the 

seventeentli ACM symposium on Operating systems principles SOSP 

'99, Volume 33 Issue 5 

Publisher: ACM Press 

Full text available- 1g? lDdf(1.79 MB) Additional Information: full citation , abstract, references , dttogs. index 

terms 

Taking the concept of thin clients to the limit, this paper proposes that desktop machines 
should just be simple, stateless I/O devices (display, keyboard, mouse, etc.) that access a 
shared pool of computational resources over a dedicated interconnection fabric — much 
in the same way as a building's telephone services are accessed by a collection of handset 
devices. The stateless desktop design provides a useful mobility model In which users can 
transparently resume their work on any desktop c ... 

20 Compiler-directed channel allocation for saving power in on-chip netv\/orks 
Guangyu Chen, Feihui Li, Mahmut Kandemir 

January 2006 ACM SIGPLAN Notices , Conference record of the 33rci ACM SIGPLAN- 
SIGACT symposium on Principles of programming languages POPL *06, 

Volume 41 Issue 1 
Publisher: ACM Press 



Full text available: ^pdf(943.11 KB) Additional Information: full citation , abstract , references , citings , index 

terms 

Increasing complexity in tiie communication patterns of embedded applications 
parallelized over multiple processing units makes it difficult to continue using the 
traditional bus-based on-chip communication techniques. The main contribution of this 
paper is to demonstrate the importance of compiler technology in reducing power 
consumption of applications designed for emerging multi processor, NoC (Networl<-on- 
Chip) based embedded systems. Specifically, we propose and evaluate a compiler- 
directed a ... 

Keywords: NoC, compiler, energy consumption 
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21 System-level power optimization: techniques and tools 

Luca Benini, Giovanni de Micheli 
^ April 2000 ACM Transactions on Design Automation of Electronic Systems (TODAES), 

Volume 5 Issue 2 

Publisher: ACM Press 

Additional Information: full citation , abstract , references , citings , index 
terms 



Full text available: "g ) pdf(385.22 KB) 



This tutorial surveys design methods for ehergy -efficient system-level design. We 
consider electronic sytems consisting of a hardware platform and software layers. We 
consider the three major constituents of hardware that consume energy, namely 
computation, communication, and storage units, and we review methods of reducing their 
energy consumption. We also study models for analyzing the energy cost of software, and 
methods for energy-efficient software design and compilation. This survery ... 

22 Compiler transformations for high-performance computing 

David F, Bacon, Susan L. Graham, Oliver 3. Sharp 
^ December 1994 ACM Computing Surveys (CSUR), Volume 26 issue 4 

Publisher: ACM Press 

Additional Information: full citation , abstract , references , citings , index 
terms , review 



Full text available: 'p:i Ddf(6.32 MB) 



In the last three decades a large number of compiler transformations for optimizing 
programs have been Implemented. Most optimizations for uniprocessors reduce the 
number of instructions executed by the program using transformations based on the 
analysis of scalar quantities and data-flow techniques. In contrast, optimizations for high- 
performance superscalar, vector, and parallel processors maximize parallelism and 
memory locality with transformations that rely on tracking the properties o ... 

Keywords: compilation, dependence analysis, locality, multiprocessors, optimization, 
parallelism, superscalar processors, vectorization 
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24 Source-level global optimizations for fine-grain distributed shared memory systems 
R. Veldema, R. F. H. Hofman, R. A. F. Bhoedjang, C. J. H. Jacobs, H. E. Bal 



^ June 2001 ACM SIGPLAN Notices , Proceedings of the eighth ACM SIGPLAN 

symposium on Principles and practices of parallel programming PPoPP 

'01, Volume 36 Issue 7 
Publisher: ACM Press 

Full text available- W\ Dclf(112 60 KB) Additional Information: full citation , abstract , references , citings , index 

terms 

This paper describes and evaluates the use of aggressive static analysis in Jackal, a fine- 
grain Distributed Shared Memory (DSN) systenri for Java. Jackal uses an optinnizing, 
source-level compiler rather than the binary rewriting techniques employed by most other 
fine-grain DSM systems. Source-level analysis makes existing access-check optimizations 
(e.g., access-check batching) more effective and enables two novel fine-grain DSM 
optimizations: object-graph aggregatio ... 

25 Reducing NoC energy consumption through compiler-directed channel voltage 
# scaling 

^ Guangyu Chen, Feihui Li, Mahmut Kandemir, Mary Jane Irwin 

June 2006 ACM SIGPLAN Notices , Proceedings of the 2006 ACM SIGPLAN conference 
on Programming language design and implementation PLDI '06, volume 4i 

Issue 6 

Publisher: ACM Press 

Full text available* Ddf(536 63 KB) ^^^^^^^'^^^ Information: full citation , abstract , references , citings , index 
'^^'■^ '' terms 

While scalable NoC (Network-on-Chip) based communication architectures have clear 
advantages over long point-to-point communication channels, their power consumption 
can be very high. In contrast to most of the existing hardware -based efforts on NoC 
power optimization, this paper proposes a compiler-directed approach where the compiler 
decides the appropriate voltage/frequency levels to be used for each communication 
channel in the NoC. Our approach builds and operates on a novel graph based rep ... 

Keywords: compiler, energy^ network-on-chip 



26 Symbolic bounds analysis of pointers, array indices, and accessed memory regions 

#Radu Rugina, Martin C. Rinard 
IVIarch 2005 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 27 Issue 2 v 

Publisher: ACM Press 

Full text available: ^ pdf(490.56 KB) Additional Information: full citation , abstract , references , index terms 

This article presents a novel frannework for the symbolic bounds analysis of pointers, 
array Indices, and accessed mennory regions. Our framework formulates each analysis 
problem as a system of inequality constraints between symbolic bound polynomials. It 
then reduces the constraint system to a linear program. The solution to the linear 
program provides symbolic lower and upper bounds for the values of pointer and array 
index variables and for the regions of memory that each statement and procedur ... 

Keywords: Symbolic analysis, parallelization, static race detection 



27 Pthreads for dynamic and irregular parallelism 
Girija J. Narlikar, Guy E. Blelloch 

November 1998 Proceedings of the 1998 ACM/IEEE conference on Supercomputing 
(CDROM) Supercomputing '98 

Publisher: IEEE Computer Society 

Full text available: W\ htmU82.60 KB) Additional Information: full citation , abstract , references , citings 

High performance applications on shared memory machines have typically been written in 
a coarse grained style, with one heavyweight thread per processor. In comparison, 
programming with a large number of lightweight, parallel threads has several advantages, 
including simpler coding for programs with irregular and dynamic parallelism, and better 
adaptability to a changing number of processors. The programmer can express a new 
thread to execute each individual parallel task; the implementation dyn ... 



Keywords: Pthreads, dynamic scheduling, irregular parallelism, lightweight threads, 



multithreading, space efficiency 



28 Coordinated parallelizing connpiler optimizations and high-level synthesis 

#Sumit Gupta, Rajesh Kumar Gupta, Nikil D. Dutt, Alexandru Nicolau 
October 2004 ACM Transactions on Design Automation of Electronic Systems 

(TODAES), Volume 9 Issue 4 
Publisher: ACM Press 

Full text available: ^ pdf(923.65 KB) Additional Information: full citation , abstract , references , index terms 

We present a high-level synthesis methodology that applies a coordinated set of coarse- 
grain and fine-grain parallelizing transformations. The transformations are applied both 
during a pre-synthesis phase and during scheduling, with the objective of optimizing the 
results of synthesis and reducing the impact of control flow constructs on the quality of 
results. We first apply a set of source level presynthesis transformations that include 
common sub-expression elimination (CSE), copy propagat ... 

Keywords: Code motions, common subexpression elimination, dynamic CSE, embedded 
systems, high-level synthesis, parallelizing transformations, presynthesis 



29 Critical path reduction for scalar programs 
Michael Schlansker, Vinod Kathail 

December 1995 Proceedings of tlie 28th annual international symposium on 

Microarchitecture MICRO 28 
Publisher: IEEE Computer Society Press 

Full text available: ^ pdf(1.38 MB) Additional Information: full citation , references , citings , index terms 



30 Avoidance and suppression of compensation code in a trace scheduling compiler 
Stefan M. Freudenberger, Thomas R. Gross, P. Geoffrey Lowney 

July 1994 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 16 Issue 4 

Publisher: ACM Press 

Full text available- ' mDdf(3.58 MB) Additional Infomnation: full citation , abstract, references , citings, index 
^^■^^^^^-^ terms , review 

Trace scheduling is an optimization technique that selects a sequence of basic blocks as a 
trace and schedules the operations from the trace together. If an operation is moved 
across basic block boundaries, one or more compensation copies may be required in the 
off-trace code. This article discusses the generation of compensation code in a trace 
scheduling compiler and presents techniques for limiting the amount of compensation 
code: avoidance (restricting code motion so that no compensatio ... 

Keywords: SPEC89, instruction-level parallelism, performance evaluation, trace 
scheduling 



31 Summary of ACM/ONR workshop on parallel and distributed debugging 
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Publisher: ACM Press 
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32 Mapping esterel onto a multi-threaded embedded processor 
Xin Li, Marian Boldt, Relnhard von Hanxleden 

^ October 2006 ACM SIGARCH Computer Architecture News , ACM SIGOPS Operating 
Systems Review , ACM SIGPLAN Notices , Proceedings of the 12th 
international conference on Architectural support for programm ing 
languages and operating systems ASPLOS-XII, volume 34 , 40 , 4i issue 5,5, 
11 

Publisher: ACM Press 



Full text available: ^ pclf(490. 15 KB) Additional Information: full citation , abstract , references , index terms 



The synchronous language Esterel is well-suited for programming control-dominated 
reactive systems at the system level. It provides non-traditional control structures, in 
particular concurrency and various forms of preemption, which allow to concisely express 
reactive behayior. As these control structures cannot be mapped easily onto traditional, 
sequential processors, an alternative approach that has emerged recently makes use of 
special- purpose reactive processors. However, the designs propose ... 

Keywords:, concurrency, esterel, low-power processing, multi-threading, reactive 
systems 



33 Power reduction techniques for microprocessor systems 
Vasanth Vehkatachalam, Michael Franz 

September 2005 ACM Computing Surveys (CSUR), volume 37 issue 3 
Publisher: ACM Press 

Full text available: ^ pdf(602.33 KB) Additional Information: full citation , abstract , references , index terms 

Power consumption is a major factor that limits the performance of computers. We survey 
the "state of the art" in techniques that reduce the total power consumed by a 
microprocessor system over time. These techniques are applied at various levels ranging 
from circuits to architectures, architectures to system software, and system software to 
applications. They also include holistic approaches that will become more important over 
the next decade. We conclude that power management is a ... 




' Keywords: Energy dissipation, power reduction 



34 Parallelizing load/stores on dual-bank memory embedded processors 

Xiaotong Zhuang, Santosh Pande 
^ August 2006 ACM Transactions on Embedded Computing Systems (TECS), volume 5 issue 

3 

Publisher: ACM Press 

Full text available: ^ pdf(746.64 KB) Additional Information: full citation , abstract , references , index terms 

Many modern embedded processors such as DSPs support partitioned memory banks 
(also called X— Y memory or dual-bank memory) along with parallel load/store 
instructions to achieve higher code density and performance. In order to effectively utilize 
the parallel load/store instructions, the compiler must partition the memory-resident 
values and assign them to X or Y bank. This paper gives a postregister allocation solution 
to merge the generated load/store instructions Into their parallel counterp ... 

Keywords: DSP architectures, memory bank allocation, parallel load/stores, profile 
driven optimization 
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^ September 2001 Journal on Educational Resources in Computing (JERIC) 
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36 Data and memory optimization techniques for embedded systems 

p. R. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A, 
Vandercappelle, P. G. Kjeldsberg 

April 2001 ACM Transactions on Design Automation of Electronic Systems (TODAES), 

Volume 6 Issue 2 

Publisher: ACM Press 

Full text available- f ^pdff 339.91 KB) Additional Information: full citation , abstract, references , dtiogs. Index 
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We present a survey of the state-of-the-art techniques used In performing data and 



memory- related optimizations in embedded systems. The optimizations are targeted 
directly or indirectly at the memory subsystem, and impact one or more out of three 
important cost metrics: area, performance, and power dissipation of the resulting 
implementation. We first examine architecture-independent optimizations in the form of 
code transoformations. We next cover a broad spectrum of optimizati ... 

Keywords: DRAM, SRAM, address generation, allocation, architecture exploration, code 
transformation, data cache, data optimization, high-level synthesis, memory architecture 
customization, memory power dissipation, register file, size estimation, survey 



37 '^Flea-flicker" Multipass Pipelining: An Alternative to the High-Power Out-of'Order 
Offense 

Ronald D. Barnes, Shane Ryoo, Wen-mei W. Hwu 

November 2005 Proceedings of the 38th annual IEEE/ACM International Symposium 
on Microarchitecture MICRO 38 

Publisher: IEEE Computer Society 

Full text available: pdff346.82 KB) 

sjT Additional Information: full citation , abstract , index terms 

Publisher Site 

As microprocessor designs become increasingly powerand complexity-conscious, future 
microarchitectures must decrease their reliance on expensive dynamic scheduling 
structures. While compilers have generally proven adept at planning useful static 
instruction-level parallelism, relying solely on the compiler<Ls instruction execution 
arrangement performs poorly when cache misses occur, because variable latency is not 
well tolerated. This paper proposes a new microarchitectural model, multipass pipel ... 

38 Using high performance GIS software to visualize data: a hands-on software 
demonstration 

Linda Burton, William Hatchett, Mari Hobkirk, Charles Powell 

November 1998 Proceedings of the 1998 ACM/ IE EE conference on Supercomputing 

(CDROM) Supercomputing '98 
Publisher: IEEE Computer Society 

Full text available: Wi\ htmlf80.49 KB) Additional Information: full citation , abstract , references 

Since 1995 Wheat Ridge High School (WRHS) students have participated in a mapping 
project involving local open space, in conjunction with NASA. Students have learned to 
use Idrisi, a Geographical Imaging Systems (GIS) software, as well as other GIS 
programs Arc View and Multispec, to plan the location of a trail along Colorado's front 
range.As this project has progressed, students have learned the GIS technology as well as 
many science issues related to trail mapping. Simila ... 

39 Compiler optimizations for power, performance: Architectural analysis and 
^ instruction-set optimization for design of network protocol processors 

^ Haiyong Xie, Li Zhao, Laxmi Bhuyan 

October 2003 Proceedings of the 1st lEEE/ACM/IFIP international conference on 

Hardware/software codesign and system synthesis CODES +ISSS '03 
Publisher: ACM Press 

Full text available pdff 87.43 KB) Additional Information: full citation , abstrad. references , citings, index 
i::^^^^ terms 

TCP/IP protocol processing latency has been an inriportant issue in high-speed networks. 
In this paper, we present an architectural study of TCP/IP protocol. We port the TCP/IP 
protocol stack from the 4.4 FreeBSD to the SimpleScalar simulation environment. The 
architectural characteristics, such as instruction level parallelism and cache behavior, are 
studied through simulation. We also compare the characteristics of TCP/IP protocol to that 
of SPECint benchmark programs. It turns out that the form ... 

Keywords: TCP/IP protocol, architecture simulation, instruction optimization, network 
processing 
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Compiling Fortran 8x array features for the connection machine computer system 



Eugene Albert, Kathleen Knobe, Joan D. Lukas, Guy L. Steele 

January 1988 ACM SIGPLAN Notices , Proceedings of the ACM/SIGPLAN conference 
on Parallel programming: experience with applications, languages and 
systems PPEALS *88, volume 23 issue 9- 

Publisher: ACM Press 

Full text available' Wl Ddfd 72 MB) Additional Information: full citation , abstract , references , citings , index 
' terms 

The Connection Machine® computer system supports a data parallel programming style, 
making it a natural target architecture for Fortran 8x array constructs. The Connection 
Machine Fortran compiler generates VAX code that performs scalar operations and directs 
the Connection Machine to perform array operations. The Connection Machine virtual 
processor mechanism supports elemental operations on very large arrays. Most array 
operators and Intrinsic functions map Into single instructions or ... 



Results 21 - 40 of 200 Result page: previous 123456Z8910 next 

The ACM Portal is published by the Association for Computing Machinery. Copyright © 2007 ACM, Inc. 
Terms of Usage Privacy Policy Code of Ethics Contact Us 

Useful downloads: Adobe Acrobat Cii QuickTime ft Windows Media Player Real Plaver 




P#%T? TAT 



USPTO 



Subscribe (Full Service) Register (Limited Service, Free) Login 

Search: ® The ACM Digital Library O The Guide 

[compiler optimization multi-threading code motion critical sect! iliSii 



Feedback Report a problem Satisfaction 
survey 



Terms used 

compiler optimization multi threading code motion critical section network processor 



Found 60,664 of 
199,787 



Sort results 
by 

Display 
results 



relevance W ^ Save results to a Bmder 

' ^ fsi 

^^^>^>^>...^^ ^ Search Tips 

\m^?^.^B.M □ open results in a new 

window 



Try an Advanced Search 

Try this search in The ACM Guide 



Results 41 - 60 of 200 
Best 200 shown 



Result page: previous 12345678910 next 

Relevance scale □ B H 



Optimizing parallel programs with explicit synchronization 
Arvind Krishnamurthy, Katherine Yelick 
^ June 1995 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 1995 conference 
on Programming language design and implementation PLDI '95, volume 30 

Issue 6 

Publisher: ACM Press 

Additional Infornnation: full citation , abstract , references , citings , index 
terms 



Full text available: MB) 



We present compiler analyses and optimizations for explicitly parallel programs that 
communicate through a shared address space. Any type of code motion on explicitly 
parallel programs requires a new kind of analysis to ensure that operations reordered on 
one processor cannot be observed by another. The analysis, based on work by Shasha 
and Snir, checks for cycles among interfering accesses. We improve the accuracy of their 
analysis by using additional information from post-wait synchroniza ... 



*2 1 - Special Section: Link-time compaction and optimization of ARM executables Q 

Bjorn De Sutter, Ludo Van Put, Dominique Chanet, Bruno De Bus, Koen De Bosschere 
T February 2007 ACM Transactions on Embedded Computing Systems (TECS), volume 6 

Issue 1 

Publisher: ACM Press 

Full text available: ^pdff636.53 KB) Additional Information: full citation , abstract , references , index terms 

The overhead in terms of code size, power consumption, and execution time caused by 
the use of precompiled libraries and separate compilation is often unacceptable in the 
embedded world, where real-time constraints, battery life-time, and production costs are 
of critical importance. In this paper, we present our link-time optimizer for the ARM 
architecture. We discuss how we can deal with the peculiarities of the ARM architecture 
related to its visible program counter and how the introduced over ... 



Keywords: Performance, compaction, linker, optimization 



43 A lifetime optimal algorithm for speculative PRE H 
Jingling Xue, Qiong Cai 

^ June 2006 ACM Transactions on Architecture and Code Optimization (TACO), volume 3 

Issue 2 . 

Publisher: ACM Press 

Full text available: ^pdf(1.38 MB) Additional Information: full citation , abstract , references , index terms 

A lifetime optimal algorithm, called MC-PRE, is presented for the first time that performs 
speculative PRE based on edge profiles. In addition to being computationally optimal in 
the sense that the total number of dynamic computations for an expression in the 
transformed code is minimized, MC-PRE is also lifetime optimal since the lifetimes of 
introduced temporaries are also minimized. The key in achieving lifetime optimality lies 
not only in finding a unique minimum cut on a transformed graph o ... 



Keywords: Partial redundancy elimination, classic PRE, computational optimality, data- 
flow analysis, lifetime optimality, speculative PRE 



The program decision logic approach to predicated execution Q 
David I. August, John W. Sias, Jean-Michel Puiatti, Scott A. I^ahlke, Daniel A. Connors, Kevin 
^ M. Crozier, Wen-mei W. Hwu 

May 1999 ACM SIGARCH Computer Architecture News , Proceedings of tlie 26tii 

annual international symposium on Computer architecture ISCA *99, volume 

27 Issue 2 

Publisher: IEEE Computer Society, ACM Press 

Full text available: ^ pdf(215.39 KB) Additional Information: full citation , abstract , references , citings , index 
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Moderr) compilers must expose sufficient amounts of Instruction-Level Parallelism (ILP) to 
achieve the promised performance increases of superscalar and VLIW processors. One of 
the major impediments to achieving this goal has been inefficient programmatic control 
flow. Historically, the compiler has translated the programmer's original control structure 
directly into assembly code with conditional branch instructions. Eliminating inefficiencies 
in handling branch instructions and exploiting ILP h ... 

45 Architectures and performance analysis: New approach to architectural synthesis: B 
^ incorporating QoS constraint 

^ Harsh Dhand, Basant Dwivedi, 1^ Balakrishnan 

October 2006 Proceedings of the 6th ACM & IEEE International conference on 
Embedded software EMSOFT '06 

Publisher: ACM Press 

Full text available: pclf(209.86 KB) Additional Information: full citation , abstract , references , index ternns 

Embedded applications lil<e video. decoding, video streaming and those in the networic 
domain, typically have a Quality of Service (QoS) requirement which needs to be met. 
Apart from being a design constraint, it can also be considered as a flexibility that the 
design does not have to work under worst case data condition. For example, in the case 
of video decoding with variable decoding time for individual frames, it may be adequate 
that only a fraction of frames (say 90%) needs to be decoded. In t ... 

Keywords: mapping, partitioning, process network, quality of service, soft real time 
constraints 



46 Optimizing for space and time usage with speculative partial redundancy elimination | 
^ Bernhard Scholz, Nigel Horspool, Jens Knoop 

^ June 2004 ACM SIGPLAN Notices , Proceedings of the 2004 ACM SIGPLAN/SIGBED 
conference on Languages, compilers, and tools for embedded systems 

LCTES '04, Volume 39 Issue 7 ' 
Publisher: ACM Press 
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Speculative partial redundancy elimination (SPRE) uses execution profiles to improve the 
expected performance of programs. We show how the problem of placing expressions to 
achieve the optimal expected performance can be mapped to a particular kind of network 
flow problem and hence solved by well known techniques. Our solution is sufficiently 
efficient to be used in practice. Furthermore, the objective function may be chosen so that 
reduction in space requirements is the primary goal and executi ... 

Keywords: code motion, common subexpressions, partial redundancy, profile-guided 
optimization, speculation 



47 Multithreading and multiprocessing: A case study of multi-threading in the embedded 
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Greg Hoover, Forrest Brewer, Timothy Sherwood 



October 2006 Proceedings of the 2006 international conference on Compilers, 
architecture and synthesis for embedded systems CASES *06 

Publisher: ACM Press 

Full text available: '^ pdf(1.41 MB) Additional Information: full citation , abstract , references , index terms 

The continuing miniaturization of technology coupled with wireless networl<s has made It 
feasible to physically embed. sensor network systems into the environment. Sensor net 
processors are tasked with the job of handling a disparate set of interrupt driven activity, 
from networks to timers to the sensors themselves, In this paper, we demonstrate the 
advantages of a tiny multi-threaded microcontroller design which targets embedded 
applications that need to respond to events at high speed. While mult ... 

Keywords: embedded architecture, multi-threading 



Einerging areas: Programming challenges in network processor deployment 
^ Chidamber Kulkarni, Matthias Gries, Christian Sauer, Kurt Keutzer 

October 2003 Proceedings of the 2003 international conference on Compilers, 
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Publisher: ACM Press 
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Programming multi-processor ASIPs, such as network processors, remains an art due to 
the wide variety of architectures and due to little support for exploring different 
implementation alternatives. We present a study that implements an IP forwarding router 
application on two different network processors to better understand the main challenges 
in programming such multi-processor ASIPs. The goal of this study is to Identify the 
elements central to a successful deployment of such systems based on ... 

Keywords: IPv4 forwarding, mapping, multi-threading, programming heterogeneous 
architectures, programming model, resource sharing 



49 Taming the IXP network processor 

Lai George, Matthias Blume 
^ May 2003 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 2003 conference 
on Programming language design and implementation PLDI '03, volume 38 

Issue 5 

Publisher: ACM Press 
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We compile Nova, a new language designed for writing network processing applications, 
using a back end based on integer-linear programming (ILP) for register allocation, 
optimal bank assignment, and spills. The compiler's optimizer employs CPS as its 
intermediate representation; some of the invariants tliat this IR guarantees are essential 
for the formulation of a practical ILP model. Appel and George used a similar ILP-based 
technique for the IA32 to decide which variables reside in registers but ... 

Keywords: Intel IXA, bank assignment, code generation, integer linear programming, 
network processors, programming languages, register allocation 



50 Tuning compiler optimizations for simultaneous multithreading 
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Compiler optimizations are often driven by specific assumptions about the underlying 
architecture and Implementation of the target machine. For example, when targeting 
shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in 



order to decrease high-cost, inter-processor communication. This paper reexamines 
several compiler optimizations in the context of simultaneous multithreading (SMT), a 
processor architecture that issues instructions from multiple threads to the f ... 

Keywords: cache size, compiler optimizations, cyclic algorithm, fine-grained sharing, 
instructions, inter-processor communication, inter-thread instruction-level parallelism, 
latency hiding, loop tiling, loop-iteration scheduling, memory system resources, 
optimising compilers, parallel architecture, parallel programs, performance, processor 
architecture, shared -memory multiprocessors, simultaneous multithreading, software 
speculative execution 
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# XI 

^ Christian Bell, Wei-Yu Chen, Dan Bonachea, Katherine Yelick 

June 2004 Proceedings of the 18th annual international conference on 

Supercomputing ICS '04 
Publisher: ACM Press 
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ternns 

The Cray XI was recently introduced as the first in a new line of parallel systems to 
combine high-bandwidth vector processing with an MPP system architecture. Alongside 
capabilities such as automatic fine-grained data parallelism through the use of vector 
instructions, the XI offers hardware support for a transparent global-address space 
(GAS), which makes it an interesting target for GAS languages. In this paper, we describe 
our experience with developing a portable, open-source and high perfo ... 

Keywords: UPC, XI, global address space 



52 Compilation: Efficient spill code for SDRAM 
V. Krishna Nandlvada, Jens Palsberg 

October 2003 Proceedings of the 2003 international conference on Compilers, 

architecture and synthesis for embedded systems CASES '03 
Publisher: ACM Press 

Additional Information: full citation , abstract , references , citings , index 
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Processors such as StrongARM and memory such as SDRAM enable efficient execution of 
multiple loads and stores in a single instruction. This is particularly useful in connection 
with register allocation where spill code may need to save and restore multiple registers. 
Until now, there has been no effective strategy for utilizing this to its full potential. In this 
paper we investigate the use of SDRAM for optimization of spill code. The core of the 
problem is to arrange the variables in the spill ... 

Keywords: SDRAM, integer linear programming, memory layout, optimization 



53 Software thread integration for embedded system display applications 
Alexander G. Dean 

February 2006 ACM Transactions on Embedded Computing Systems (TECS), volume s 

Issue 1 

Publisher: ACM Press 

Full text available: '^ pdf(1.40 MB) Additional Information: full citation , abstract , references , index terms 

. Embedded systems require control of many concurrent real-time activities, leading to 
system designs that feature a variety of hardware peripherals, with each providing a 
specific, dedicated service. These peripherals increase system size, cost, weight, and 
design time. Software thread integration (STI) provides low-cost thread concurrency on 
general-purpose processors by automatically interleaving multiple threads of control into 
one. This simplifies hardware to software migration (which elimlna ... 



Keywords: fine-grain concurrency, hardware to software migration, software thread 
integration 
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55 HPFBench: a high performance Fortran benchmark suite 
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The high performance Fortran (HPF) benchmark suite HPFBench is designed for evaluating 
the HPF language and compilers on scalable architectures. The functionality of the 
benchmarks covers scientific software library functions and application kernels that reflect 
the computational structure and communication patterns in fluid dynamic simulations, 
fundamental physics, and molecular studies in chemistry and biology. The benchmarks 
are characterized in terms of FLOP count, memory usage, communi ... 
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56 Special issue: A! in enoineerina 

❖ D. Sriram, R. Joobbani 
April 1985 ACM SIGART Bulletin, issue 92 
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The papers in this special issue were compiled from responses to the announcement in 
the July 1984 issue of the SIGART newsletter and notices posted over the ARPAnet. The 
interest being shown in this area is reflected in the sixty papers received from over six 
countries. About half the papers were received over the computer network. 

57 Meta optimization: improving compiler heuristics with machine learning 
Mark Stephenson, Saman Amarasinghe, Martin Martin, Una-May O'Reilly 

^ May 2003 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 2003 conference 
on Programming language design and implementation PLDI '03, volume 38 
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Full text available: W] pdf(302.23 KB) . 

terms 

Compiler writers have crafted many heuristics over the years to approximately solve NP- 
hard problems efficiently. Finding a heuristic that performs well on a broad range of * 
applications is a tedious and difficult process. This paper introduces Meta Optimization, a 
methodology for automatically fine-tuning compiler heuristics. Meta Optimization uses 
machine-learning techniques to automatically search the space of compiler heuristics. Our 
techniques reduce compiler design complexity by relieving c ... 

Keywords: compiler heuristics, genetic programming, machine learning, priority 
functions 
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59 Fast and efficient searches for effective optimization-phase sequences 
^ Prasad A. Kulkarni, Stephen R. Mines, David B. Whalley, Jason D. Hiser, Jack W. Davidson, 
Douglas L. Jones 

June 2005 ACM Transactions on Architecture and Code Optimization (TACO), volume 2 
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It has long been known that a fixed ordering of optimization phases will not produce the 
best code for every application. One approach for addressing this phase-ordering problem 
is to use an evolutionary algorithm to search for a specific sequence of phases for each 
module or function. While such searches have been shown to produce more efficient code, 
the approach can be extremely slow because the application is compiled and possibly 
executed to evaluate each sequence's effectiveness. Consequen ... 

Keywords: Phase ordering, genetic algorithms, interactive compilation 
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^1 Efficient and correct execution of parallel programs that share memory H 
^ Dennis Shasha, Marc Snir 

^ April 1988 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 10 Issue 2 

Publisher: ACM Press 
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Full text available: ' glpdffZSS MB) 



In this paper we consider an optlnDization problem that arises in the execution of parallel 
programs on shared-memory multiple-instruction-stream, multiple-data-stream (MIMD) 
computers. A program on such machines consists of many sequential program segments, 
each executed by a single processor. These segments interact as they access shared 
variables. Access to memory is asynchronous, and memory accesses are not necessarily 
executed in the order they were issued. An execution is correct if it ... 

^2 Multigrain shared memory 

Donald Yeung, John Kubiatowicz, Anant Agarwal 

May 2000 ACM Transactions on Computer Systems (TOCS), volume is issue 2 
Publisher: ACM Press 

Additional Information: full citation , abstract , references , citings , index 
terms , review 



Full text available: pi pdf(369.18 KB) 



Parallel workstations, each comprising tens of processors based on shared memory, 
promise cost-effective scalable multiprocessing. This article explores the coupling of such 
small- to medium-scale shared-memory multiprocessors through software over a local 
area network to synthesize larger shared -memory systems. We call these systems 
Distributed Shared-memory Multiprocessors (DSMPs). This article introduces the design of 
a shared-memory system that uses multiple granularities of sharing, ca ... 
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63 Compiler and runtime support for efficient software transactional memory Q 
^ Alj-Reza AdI-Tabatabai, Brian T. Lewis, Vijay Menon, Brian R. Murphy, Bratin Saha, Tatiana 
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June 2006 ACM SIGPLAN Notices , Proceedings of the 2006 ACM SIGPLAN conference 
on Programming language design and implementation PLDI '06, volume 41 
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Programmers have traditionally used locl<s to synchronize concurrent access to shared 
data. Loclc-based synchronization, however, has well-l<nown pitfalls: using locks for fine- 
grain synchronization and composing code that already uses locks are both difficult and 
prone to deadlock. Transactional memory provides an alternate concurrency control 



mechanism that avoids these pitfalls and significantly eases concurrent programming. 
Transactional memory language constructs have recently been proposed as ... 

Keywords: code generation, compiler optimizations, locking, synchronization, 
transactional memory, virtual machines 



64 Instruction fetch and control flow: Reducing control overhead in dataflow 
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^ Andrew Petersen, Andrew Putnam, Martha Mercaldi, Andrew Schwerin, Susan Eggers, Steve 
Swanson, Mark Oskin 

September 2006 Proceedings of the 15th international conference on Parallel 
architectures and compilation techniques PACT '06 

Publisher: ACM Press 

Full text available: 'g] pdf(662.72 KB) Additional Information: full citation , abstract , references , index terms 

In recent years, computer architects have proposed tiled architectures in response to 
several emerging problems in processor design, such as design complexity, wire delay, 
and fabrication reliability. One of these architectures, WaveScalar, uses a dynamic, 
tagged -tolcen dataflow execution model to simplify the design of the processor tiles and 
their interconnection network and to achieve good parallel performance. However, using a 
dataflow execution model reawakens old problems, including the ins ... 

Keywords: Wavescalar, compiler, dataflow, tiled architecture 



65 Su pport for High-Frequency Streaming in CMPs 

Ram Rangan, Neil Vachharajani, Adam Stoler, Guilherme Ottoni, David I. August, George Z. 
N. Cai 

December 2006 Proceedings of the 39th Annual IEEE/ACM International Symposium 
on Microarchitecture MICRO 39 

Publisher: IEEE Computer Society 

Full text available: ^ pdf(283.43 KB) Additional Information: full citation , abstract , index terms 

As tlie Industry moves toward larger-scale chip multiprocessors, tfie need to parallelize 
applications grows. High inter-thread communication delays, exacerbated by over- 
stressed high-latency memory subsystems and ever-increasing wire delays, require 
parallelization techniques to create partially or fully independent threads to improve 
performance. Unfortunately, developers and compilers alike often fail to find sufficient 
independent work of this kind. Recently proposed pipelined streaming techni ... 

66 Measurement and evaluation of the MIPS architecture and processor 
Thomas R. Gross, John L. Hennessy, Steven A. Przybylski, Christopher Rowen 

^ August 1988 ACM Transactions on Computer Systems (TOGS), volume 6 issue 3 

Publisher: ACM Press 
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MIPS is a 32-bit processor architecture that has been implemented as an nMOS VLSI chip. 
The instruction set architecture is RISC-based. Close coupling with compilers and efficient 
use of the instruction set by compiled programs were goals of the architecture. The MIPS 
architecture requires that the software implement some constraints in the design that are 
normally considered part of the hardware implementation. This paper presents 
experimental results on the effectiveness of this processor ... 

67 Technical reports 

SIGACT News Staff 

January 1980 ACM SIGACT News, volume 12 issue i 
Publisher: ACM Press 

Full text available: ^ pdf(5.28 MB) Additional Information: full citation 
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Automatic compilation to a coarse-grained reconfigurable system-opn-chip 
Girish Venkataramani, Walid Najjar, Fadi Kurdahi, Nader Bagherzadeh, Wim Bohm, Jeff 
Hammes 

November 2003 ACM Transactions on Embedded Computing Systems (TECS), volume 2 

Issue 4 

Publisher: ACM Press 

Full text available- W\ pdf(687 52 KB) Additional Information: full citation , abstract , references , citings , index 
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The rapid growth of device densities on silicon has made it feasible to deploy 
reconfigurable hardware as a highly parallel computing platform. However, one of the 
obstacles to the wider acceptance of this technology is its programmabllity. The 
application needs to be programmed in hardware description languages or an assembly 
equivalent, whereas most application programmers are used to the algorithmic 
programming paradigm. SA-C has been proposed as an expression-oriented language 
designed to im ... 

Keywords: Reconfigurable computing, SIMD, compilers 



69 VISTA: VPO interactive system for tuning applications 

^ Prasad Kulkarni, Wankang Zhao, Stephen Mines, David Whalley, Xin Yuan, Robert van 
^ Engelen, Kyle Gallivan, Jason Hiser, Jack Davidson, Baosheng Cai, Mark Bailey, Hwashin 
Moon, Kyunghwan Cho, Yunheung Paek 

November 2006 ACM Transactions on Embedded Computing Systems (TECS), volume 5 
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Publisher: ACM Press 

Full text available: ^ pdf(4.01 MB) Additional Information: full citation , abstract , references , index terms 

Software designers face many challenges when developing applications for embedded 
systems. One major challenge is meeting the conflicting constraints of speed, code size, 
and power consumption. Embedded application developers often resort to hand-coded 
assembly language to meet these constraints since traditional optimizing compiler 
technology is usually of little help in addressing this challenge. The results are software 
systems that are not portable, less robust, and more costly to develop an ... 

Keywords: User-directed code improvement, genetic algorithms, interactive compilation, 
phase ordering 



70 Compilation: Cluster assignment of global values for clustered VLIW processors 

Andrei Terechko, Erwan Le Th6naff, Henk Corporaal 
^ October 2003 Proceedings of the 2003 international conference on Compilers, 
architecture and synthesis for embedded systems CASES '03 

Publisher: ACM Press 

Full text available: 'g| pdf(330.94 KB) Additional Information: full citation , abstract , references , index terms 

In this paper high-level language (HLL) variables that are alive in a whole HLL function, 
across multiple scheduling units, are termed as global values. Due to their long live 
ranges and, hence, large impact on the schedule, the global values require different 
compiler optimizations than local values, which span across only one scheduling unit. The 
instruction scheduler for a clustered ILP processor, which is responsible for cluster 
assignment of operations and variables, faces a difficult probi ... 

Keywords: ILP, VLIW, cluster assignment, compiler, instruction scheduler, register 
allocation 



71 Helper threads via virtual multithreading on an experimental Itanium ^ 2 processor- 
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^ Perry H. Wang, Jamison D. Collins, Hong Wang, Dongkeun Kim, Bill Greene, Kai-Ming Chan, 
Aamir B. Yunus, Terry Sych, Stephen F. Moore, John P. Shen 

October 2004 ACM SIGPLAN Notices , ACM SIGOPS Operating Systems Review , ACM 
SIGARCH Computer Architecture News , Proceedings of the 11th 
international conference on Architectural support for programm ing 
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Publisher: ACM Press 
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Helper threading is a technology to accelerate a program by exploiting a processor's 
multithreading capability to run ' 'assist" threads. Previous experiments on hyper- 
threaded processors have demonstrated significant speedups by using helper threads to 
prefetch hard-to-predict delinquent data accesses. In order to apply this technique to 
processors that do not have built-in hardware support for multithreading, we introduce 
virtual multithreading (VMT), a novel form of switch-on-event user-level ... 

Keywords: DB2 database, PAL, cache miss prefetching, helper thread, Itanium processor, 
multithreading, switch-on-event 



72 A study of devirtualization techniques for a Java Just-In-Time compiler 

Kazuaki Ishizaki, Motohiro Kawahito, Toshiaki Yasue, Hideaki Komatsu, Toshio Nakatani 
October 2000 ACM SIGPLAN Notices , Proceedings of the 15th ACM SIGPLAN 
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applications OOPSLA '00, Volume 35 Issue 10 
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Many devirtualization techniques have been proposed to reduce the runtime overhead of 
dynamic method calls for various object-oriented languages, however, most of them are 
less effective or cannot be applied for Java in a straightforward manner. This is partly 
because Java is a statically-typed language and thus transforming a dynamic call to a 
static one does not make a tangible performance gain (owing to the low overhead of 
accessing the method table) unless it is inlined, and partly because t ... 



73 Code and data layout optimisations for embedded software: An interprocedural code 
optimization technique for network processors using hardware multi-threading 
support 

Hanno Scharwaechter, Manuel Hohenauer, Rainer Leupers, Gerd Ascheid, Heinrich Meyr 
March 2006 Proceedings of the conference on Design, automation and test in Europe: 

Proceedings DATE '06 
Publisher: European Design and Automation Association 

Full text available: ^ pdf( 174.69 KB) Additional Information: full citation , abstract , references 

Sophisticated C compiler support for network processors (NPUs) is required to improve 
their usability and consequently, their acceptance in system design. Nonetheless, high- 
level code compilation always introduces overhead, regarding code size and performance 
compared to handwritten assembly code. This overhead results partially from high-level 
function calls that usually introduce memory accesses in order to save and reload register 
contents. A key feature of many NPU architectures is hardware ... 
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symposium on Memory management ISMM '00, volume 36 issue i 
Publisher: ACM Press 
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This paper reports our experiences with a mostly-concurrent incremental garbage 
collector, implemented in the context of a high performance virtual machine for the Java^ 
programming language. The garbage collector is based on the ''mostly parallel" collection 
algorithm of Boehm eta/, and can be used as the old generation of a generational 
memory system. It overloads efficient write-barrier code already generated to support 
generational garbage collection to also ident ... 



75 Exploiting coarse-grained task, data, and pipeline parallelism in stream programs 
Michael I. Gordon, William Thies, Saman Amarasinghe 



October 2006 ACM SIGOPS Operating Systems Review , ACM SIGARCH Computer 
Architecture News , ACM SIGPLAN Notices , Proceedings of the 12th 
international conference on Architectural support for programm ing 
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Publisher: ACM Press 
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As multicore architectures enter the mainstream^ there is a pressing demand for high- 
level programming models that can effectively map to them. Stream programming offers 
an attractive way to expose coarse-grained parallelism, as streaming applications (image, 
video, DSP, etc.) are naturally represented by independent filters that communicate over 
explicit data channels.In this paper, we demonstrate an end-to-end stream compiler that 
attains robust multicore performance in the face of varying app ... 

Keywords: Raw, Streamit, coarse-grained dataflow, multicore, software pipelining, 
streams 
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Unified Parallel C (UPC) is a parallel language that uses a Single Program Multiple Data 
(SPMD) model of parallelism within a global address space. The global address space is 
used to simplify programming, especially on applications with irregular data structures 
that lead to fine-grained sharing between threads. Recent results have shown that the 
performance of UPC using a commercial compiler is comparable to that of MPI [7]. In this 
paper we describe a portable open source compiler for UPC. Ou ... 
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This paper relates our experience in designing from scratch a multi-threaded kernel for a 
MIPS R3000 on-chip multiprocessor. We briefly present the target architecture build 
around a VCI compliant interconnect, and the CPU characteristics. Then we focus on the 
implementation of part of the POSIX 1003.1b and 1003.1c standards. We conclude this 
case study by simulation results obtained by cycle true simulation of an MJPEG video 
decoder application on the multiprocessor, using several scheduler org ... 

78 A Performance Evaluation of the Convex SPP-1000 Scalable Shared Memory 
^ Parallel Computer 

^ Thomas Sterling, Daniel Savaresse, Peter MacNeice, Kevin Olson, Clark Mobarry, Bruce 
Fryxell, Phillip Merkey 
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79 Runtime optimizations for a Java DSM implementation 
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Jackal is a fine-grained distributed shared memory implementation of the Java 
programming language. Jackal implements Java's memory model and allows 
multithreaded Java programs to ru n unmodified on distributed-memory systems. 

This paper focuses on Jackal's runtime system, which implements a multiple-writer, 
home-based consistency protocol. Protocol actions are triggered by software access 
checks that Jackal's compiler inserts before object and array references. We describe 
optimizatio ... 

80 Fast searches for effective optimization phase sequences 
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^ June 2004 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 2004 conference 
on Programming language design and implementation PLDI '04, volume 39 

Issue 6 

Publisher: ACM Press 

.- n * ^ I ui ««i ^/Qco >in i^Dx Additional Information: full citation , abstract , references , citings , index 
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It has long been known that a fixed ordering of optimization phases will not produce the 
best code for every application. One approach for addressing this phase ordering problem 
is to use an evolutionary algorithm to search for a specific sequence of phases for each 
module or function. While such searches have been shown to produce more efficient code, 
the approach can be extremely slow because the application is compiled and executed to 
evaluate each sequence's effectiveness. Consequently, evol ... 
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81 Communication optimizations for parallel C programs B 
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This paper presents algorithms for reducing the communication overhead for parallel C 
programs that use dynamically-allocated data structures. The framework consists of an 
analysis phase called possible-placement analysis , and a transformation phase called 
communication selection. l\\e fundamental idea of possible-placement analysis is to find 
all possible points for insertion of remote memory operations. Remote reads are 
propagated upwards, whereas remote writes are propagate ... 
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This paper explores compiler techniques for reducing the memory needed to load and run 
progrann executables. In embedded systems, where economic incentives to reduce both 
RAM and ROM are strong, the size of compiled code is increasingly important. Similarly, in 
mobile and network computing, the need to transmit an executable before running it 
places a premium on code size. Our work focuses on reducing the size of a program's 
code segment, using pattern-matching techniques to identify and coalesce ... 

The elements of nature: interactive and realistic techniques Q 
Oliver Deusen, David S. Ebert, Ron Fedklw, F. Kenton Musgrave, Przemyslaw Prusinkiewicz, 
Doug Roble, Jos Stam, Jerry Tessendorf 
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This updated course on simulating natural phenomena will cover the latest research and 
production techniques for simulating most of the elements of nature. The presenters will 
provide movie production, interactive simulation, and research perspectives on the 
difficult task of photorealistic modeling, rendering, and animation of natural phenomena. 
The course offers a nice balance of the latest interactive graphics hardware -based 
simulation techniques and the latest physics-based simulation techni ... 
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Product-line architectures (PLAs) are an emerging paradigm for developing software 
families for distributed real-time and embedded (DRE) systems by customizing reusable 
artifacts, rather than hand-crafting software from scratch. To reduce the effort of 
developing software PLAs and product variants for DRE systems, developers are applying 
general-purpose — ideally standard — middleware platforms whose reusable services and 
mechanisms support a range of application quality of service (QoS) requi ... 
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We describe the use of neural networks for optimization and inference associated with a 
variety of complex systems. We show how a string formalism can be used for parallel 
computer decomposition, message routing and sequential optimizing compilers. We 
extend these ideas to a general treatment of spatial assessment and distributed artificial 
intelligence. 
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Volume 1 Issue 3 

- Publisher: ACM Press 
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Dynamic voltage scaling (DVS) has become an important dynamic power-management 
technique to save energy. DVS tunes the power-performance tradeoff to the needs of the 
application. The goal is to minimize energy consumption while meeting performance 
needs. Since CPU power consumption is strongly dependent on the supply voltage, DVS 
exploits the ability to control the power consumption by varying a processor's supply 
voltage and clock frequency. However, because of the energy and time overhead asso ... 

Keywords: Analytical model, compiler, dynamic voltage scaling, low power, mixed- 
integer linear programming 
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February 2004 Proceedings of the conference on Design, automation and test in 
Europe - Volume 2 DATE '04 

Publisher: IEEE Computer Society 

Full text available: Qpdf(136.20 KB) Additional Information: full citation , abstract , index terms 

In recent years, research on wireless sensor networks has been undergoing a revolution, 
promising to have significant impact on a broad range of applications from military to 
health care to food safety. An important problem in many sensor network applications is 
to decide the amount of computation (or filtering) that needs to be done in the sensor 
nodes before the data are shifted to a central base station. Right amount of data filtering 
, in the sensor nodes can lead to large savings in network-w ... 



Technical papers: MPI and communication— Software routing and aggregation of 

messages to optimize the performance of HPCC randomaccess benchmark 
Rahul Garg, Yoglsh Sabharwal 

November 2006 Proceedings of the 2006 ACM/IEEE conference on Supercomputing SC 
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Publisher: ACM Press 
Full text available: P pdf(158.36 KB) 
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The HPC Challenge(HPCC) benchmark suite is increasingly being used to evaluate the 
performance of supercomputers. It augments the traditional UNPACK benchmark by 
adding six more benchmarks, each designed to measure a specific aspect of the system 
performance.In this paper, we analyze the HPCC Randomaccess benchmark which is 
designed to measure the performance of random memory updates. We show that, on 
many systems, the bisection bandwidth of the network may be the performance 
bottleneck of this ... 

Keywords: Blue Gene/L, HPC challenge, benchmarks, high performance computing, 
Unpack, randomaccess, supercomputing 



Embedded applications: Architectural optimizations for low-power, real-time speech 
recognition 

Rajeev Krishna, Scott Mahike, Todd Austin 

October 2003 Proceedings of the 2003 international conference on Compilers, 

architecture and synthesis for embedded systems CASES '03 
Publisher: ACM Press 

Full text available: fg i pdf(634.03 KB) Additional Information: full citation , abstract, references , dtiogs. index 
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The proliferation of computing technology to low power domains such as hand— held 
devices has lead to increased interest in portable interface technologies, with particular 
interest in speech recognition. The computational demands of robust, large vocabulary 
speech recognition systems, however, are currently prohibitive for such low power 
devices. This work begins anexploration of domain specific characteristics of speech 
recognition that might be exploited to achieve the requisite performance w ... 

APRIL: a processor architecture for multiprocessing 
Anant Agarwal, Beng-Hong Lim, David Kranz, John Kubiatowicz 

May 1990 ACM SIGARCH Computer Architecture News , Proceedings of the 17th 

annual international symposium on Computer Architecture ISCA '90, volume 

18 Issue 3a 

Publisher: ACM Press 
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Processors in large-scale multiprocessors must be able to tolerate large communication 
latencies and synchronization delays. This paper describes the architecture of a rapid- 
context-switching processor called APRIL with support for fine-grain threads and 
synchronization. APRIL achieves high single-thread performance and supports virtual 
dynamic threads. A commercial RISC-based implementation of APRIL and a run-time 
software system that can switch contexts in about 10 cycles is described. Me ... 

Problem and machine sensitive communication optimization 
Thomas Fahringer, Eduard Mehofer 

July 1998 Proceedings of the 12th international conference on Supercomputing ICS 
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Publisher: ACM Press 
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A region-based compilation technique for dynamic compilers 
Toshio Suganuma, Toshiaki Yasue, Toshio Nakatani 



^ January 2006 ACM Transactions on Programming Languages and Systems (TOPLAS), 
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Publisher: ACM Press 

Full text available: ^ pclf(977.63 KB) Additional Information: full citation , abstract , references , index terms 

Method inlining and data flow analysis are two major optimization components for 
effective program transformations, but they often suffer from the existence of rarely or 
never executed code contained in the target method. One major problem lies in the 
assumption that the compilation unit is partitioned at method boundaries. This article 
describes the design and implementation of a region-based compilation technique in our 
dynamic optimization framework, in which the compiled regions are selected ... 

Keywords: JIT compiler, Region-based compilation, dynamic compilation, on-stack 
replacement, partial inlining 

93 Sensor netv\^orks and performance analysis: Impact of virtual execution environments H 

A on processor energy consumption and hardware adaptation 
^ Shiwen Hu, Lizy K. John 

June 2006 Proceedings of the second international conference on Virtual execution 
environments VEE '06 

Publisher: ACM Press 

Full text available: ^ pdf(306.86 KB) Additional Information: full citation , abstract , references , index ternns 

During recent years, microprocessor energy consumption has been surging and efforts to 
reduce power and energy have received a lot of attention. At the same time, virtual 
execution environments (VEEs), such as Java virtual machines, have grown in popularity. 
Hence, it is important to evaluate the impact of virtual execution environments on 
microprocessor energy consumption. This paper characterizes the energy and power 
impact of two important components of VEEs, Just-ln-tlme(JlT) optimization an ... 

Keywords: energy efficiency, hardware adaptation, power dissipation 
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Publisher: ACM Press 
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Multithreaded architectures are emerging as an important class of parallel machines. By 
allowing fast context switching between threads on the same processor, these systems 
hide communication and synchronization latencies and allow scalable parallelism for 
dynamic and irregular applications. Thread partitioning is the most important task in 
compiling high-level languages for multithreaded architectures. Non-preemptive 
multithreaded architectures, which can be built from off-the-shelf compon ... 

95 Reducing energy consumption of multiprocessor SoC architectures by exploiting H 

^ memory bank locality 
^ MahmutTaylan Kandemir 

April 2006 ACM Transactions on Design Automation of Electronic Systems (TODAES), 

Volume 11 Issue 2 
Publisher: ACM Press 
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The next generation embedded architectures are expected to accommodate multiple 
processors on the same chip. While this makes interprocessor communication less costly 
as compared to traditional high-end parallel machines, it also makes off-chip requests 
very costly. In particular, frequent off-chip memory accesses do not only Increase 
execution cycles but also increase overall power consumption. One way of alleviating this 
power problem is to divide the off-chip memory into multiple banks, each ... 
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An annotation language for optimizing software libraries 
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Publisher: ACIVI Press 
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This paper introduces an annotation language and a compiler that together can customize 
a library implementation for specific application needs. Our approach is distinguished by 
its ability to exploit high level, domain-specific information in the customization process. 
In particular, the annotations provide semantic information that enables our compiler to 
analyze and optimize library operations as If they were primitives of a domain-specific 
language. Thus, our approach yields many of the ... 

97 QpenMP on networks of workstations 
Honghul Lu, Y. Charlie Hu, Willy Zwaenepoel 

November 1998 Proceedings of the 1998 ACM/IEEE conference on Supercomputing 
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We describe an implementation of a sizable subset of OpenMP on networlcs of 
workstations (NOWs). By extending the availability of OpenMP to NOWs, we overcome 
one of its primary drawbacks compared to MPI, namely lack of portability to environments 
other than hardware shared memory machines. In order to support OpenMP execution on 
NOWs, our compiler targets a software distributed shared memory system (DSM) which 
provides multi-threaded execution and memory consistency.This paper presents two 
contri ... 
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November 1994 Proceedings of tlie 1994 ACM/IEEE conference on Supercomputing 

Supercomputing '94 
Publisher: ACM Press 

Full text available: ^ pdf(1.08 MB) Additional Information: full citation , abstract , references 

In adaptive irregular problems, data arrays are accessed via indirection arrays, and data 
access patterns change during computation. Parallelizing such problems on distributed 
memory machines requires support for dynamic data partitioning, efficient preprocessing 
and fast data migration. This paper describes CHAOS, a library of efficient runtime 
primitives that provides such support. To demonstrate the effectiveness of the runtime 
support, two adaptive Irregular applications have been paralleliz ... 

99 Process migration 
Dejan S. Mllojicic, Fred Douglis, Yves Paindaveine, Richard Wheeler, Songnian Zhou 
September 2000 ACI^ Computing Surveys (CSUR), Volume 32 issue 3 
Publisher: ACM Press 
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Full text available: 'm pdf(124 MB) 

^ terms , review 

Process migration is the act of transferring a process between two machines. It enables 
dynamic load distribution, fault resilience, eased system administration, and data access 
locality. Despite these goals and ongoing research efforts, migration has not achieved 
widespread use. With the increasing deployment of distributed systems in general, and 
distributed operating systems in particular, process migration is again receiving more 
attention in both research and product development. As hi ... 



Keywords: distributed operating systems, distributed systems, load distribution, process 
migration 
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101 Automatic Thread Extraction with Decoupled Software Pipelining H 
Guilherme Ottoni, Ram Rangan, Adam Stoler, David I. August 

November 2005 Proceedings of the 38th annual IEEE/ACM International Symposium 

on Microarchitecture MICRO 38 

Publisher: IEEE Computer Society 

Full text available: Wj pdf(538.89 KB) 
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Until recently, a steadily rising clock rate and other uniprocessor microarchitectural 
improvements could be relied upon to consistently deliver increasing performance for a 
wide range of applications. Current difficulties in maintaining this trend have lead 
microprocessor manufacturers to add value by incorporating multiple processors on a 
chip. Unfortunately, since decades of compiler research have not succeeded in delivering 
automatic threading for prevalent code properties, this approach dem ... 

102 Scientific Computations on Modem Parallel Vector Systems Q 
Leonid Olil<er, Andrew Canning, Jonathan Carter, John Shalf, Stephane Ethier 

November 2004 Proceedings of the 2004 ACM/IEEE conference on Supercomputing SC 
•04 

Publisher: IEEE Computer Society 

Full text available: ^pdf(239.19 KB) Additional Infomiation: full citation , abstract , citings 

Computational scientists have seen a frustrating trend of stagnating application 
performance despite dramatic increases in the claimed peak capability of high 
performance computing systems. This trend has been widely attributed to the use of 
superscalar-based commodity components whoys architectural designs offer a balance 
between memory performance, network capability, and execution rate that is poorly 
matched to the requirements of large-scale numerical computations. Recently, two 
innovative p ... 

A pipelined memory architecture for high throughput network processors B 
Timothy Sherwood, George Varghese, Brad Calder 
^ May 2003 ACM SIGARCH Computer Architecture News, Proceedings of the 30th 

annual international symposium on Computer architecture ISCA '03, volume 

31 Issue 2 

Publisher: ACM Press 

Full text available: ^ pdf(213.66 KB) Additional Information: full citation , abstract , references , citings 

Designing ASICs for each new generation of backbone routers is a time intensive and 
fiscally draining process. In this paper we focus on the design of a programmable 
architecture for backbone routers, based on the manipulation of wide irregular memory 
words, that can provide a feasible design alternative to custom ASICs. We propose a 
pipelined memory design that emphasizes worst-case throughput over latency, and co- 
explore architectural tradeoffs with the design of several important network algo ... 



Software implementation strategies for power-conscious systems 
Kshirasagar Naik, David S. L. Wei 

June 2001 Mobile Networks and Applications, volume 6 issue 3 
Publisher: Kluwer Academic Publishers 
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A variety of systems with possibly embedded computing power, such as small portable 
robots, hand-held computers, and automated vehicles, have power supply constraints. 
Their batteries generally last only for a few hours before being replaced or recharged. It is 
important that all design efforts are made to conserve power in those systems. Energy 
consumption in a system can be reduced usi ng a number of techniques, such as low- 
power electronics, architecture-level power reduction, compiler te ... 

Keywords: energy saving, low power system, software design and implementation 
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Although there has been some experimentation with Java as a language for numerically 
intensive connputing, there is a perception by many that the language is unsuited for such 
work because of performance deficiencies. In this article we show how optimizing array 
bounds checks and null pointer checks creates loop nests on which aggressive 
optimizations can be used. Applying these optimizations by hand to a simple matrix- 
multiply test case leads to Java-compliant programs whose performance is ... 

Keywords: arrays, compilers, Java 
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November 2004 Proceedings of the 2004 ACM/IEEE conference on Supercomputing SC 
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The Merrimac supercomputer uses stream processors and a high-radix network to achieve 
high performance at low cost and low power. The stream architecture matches the 
capabilities of modem semiconductor technology with compute-intensive parallel 
applications. We present a detailed case study of porting the GROMACS molecular- 
dynamics force calculation to Merrimac. The characteristics of the architecture which 
stress locality, parallelism, and decoupling of memory operations and computation, 
allow ... 

108 A fast Fourier transform compiler 
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The FFTW library for computing tfie discrete Fourier transform (DPT) has gained a wide 
acceptance in both academia and industry, because it provides excellent performance on 
a variety of machines (even competitive with or faster than equivalent libraries supplied 
by vendors). In FFTW, most of the performance-critical code was generated automatically 
by a special-purpose compiler, called genfft, that outputs C code. Written in Objective 
CamI, genfft can produce DFT programs for any input length, a ... 
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Climate modeling is a grand challenge problem where scientific progress is measured not 
in terms of the largest problem that can be solved but by the highest achievable 
integration rate. These models have been notably absent in previous Gordon Bell 
competitions due to their inability to scale to large processor counts. A scalable and 
efficient spectral element atmospheric model is presented. A new semi-implicit time 
stepping scheme accelerates the integration rate relative to an explicit model b ... 
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^ November 2005 ACI^ SIGAda Ada Letters , Proceedings of the 2005 annual ACM 
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The National Ignition Facility (NIF), currently under construction at the Lawrence 
Livermore National Laboratory, is a stadium-sized facility containing a 192-beam, 1.8 
Megajoule, 500-Terawatt, ultra-violet laser system together with a 10-meter diameter 
target chamber with room for nearly 100 experimental diagnostics. When completed, NIF 
will be the world's largest and most energetic laser experimental system, providing an 
international center to study inertial confinement fusion and physics of ... 

Keywords: Ada95, CORBA, XML, architecture, concurrency, data driven, framework, 
Java, model-based, multi-threaded, state machine, workflow 
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Demand for programming environments to exploit clusters of symmetric multiprocessors 
(SMPs) is increasing. In this paper, we present a new programming environment, called 
ParADE, to enable easy, portable, and high-performance programming on SMP clusters. It 
is an OpenMP programming environment on top of a multi-threaded software distributed 
shared memory (SDSM) system with a variant of home-based lazy release consistency 
protocol. To boost performance, the runtime system provides explicit messag ... 

Keywords: programming environment, SMP cluster, software distributedshared memory, 
hybrid programming, OpenMP, MPI 
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Publisher: ACM Press 
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New applications and standards are first conceived only for functional correctness and 
without concerns for the target architecture. The next challenge Is to map them onto an 
architecture. Embedding such applications in a portable, low-energy context Is the art of 
molding it onto an energy-efficient target architecture combined with an energy efficient 
execution. With a reconfigurable architecture, this task becomes a two-way process where 
the architecture adapts to the application and vlce-rvers ... 
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113 System & architectural-level power optimization: Selective code/data migration for Q 
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Publisher: ACM Press 

Full text available: pdf(257.84 KB) Additional Information: full citation , abstract , references , index terms 

Proliferation of embedded on-chip multiprocessor architectures (MpSoC) motivates 
researchers from both academla and industry to consider optimization techniques for such 
architectures. While many proposals on code/data partitioning on parallel architectures try 
to minimize Interprocessor communication requirements at runtime, many applications 
still have significant runtime communication requirements. This paper proposes a novel 
task/data migration scheme that decides whether to migrate task or ... 
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The capability and heterogeneity of new FPGA (Field Programmable Gate Array) devices 
continues to increase with each new line of devices. Efficiently programming these devices 
is increasing In difficulty. However, FPGAs continue to be utilized for algorithms 
traditionally targeted to embedded DSP microprocessors such as signal and image 
processing applications.This paper presents an architecture that combines VLIW (Very 
Large Instruction Word) processing with the capability to introduce applicat ... 
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It is vital to support concurrent applications sharing a wireless sensor network in order to 
reduce the deployment and administrative costs, thus increasing the usability and 
efficiency of the network. We describe Melete ^ , a system that supports concurrent 
applications with efficiency, reliability, flexibility, programmabllity, and scalability. Our 



work is based on the Mat6 virtual maciiine [1] witfi significant modifications and 
enhancements. Melete enables reliable storage and ... 
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The Java bytecode language is emerging as a software distribution standard. With major 
vendors committed to porting the Java run-time environment to their platforms, programs 
in Java bytecode are expected to run without modification on multiple platforms. These 
first generation run-time environments rely on an interpreter to bridge the gap between 
the bytecode instructions and the native hardware. This interpreter approach is sufficient 
for specialized applications $uch as Internet browsers wher ... 
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Transactional Coherence and Consistency (TCC) provides a new parallel programming 
model that uses transactions as the basic unit of parallel work and communication. TCC 
simplifies the development of correct parallel code because hardware provides transaction 
atomicity and ordering. Nevertheless, the programmer or a dynamic compiler must still 
optimize the parallel code for performance.This paper presents TAPE, a hardware and 
software infrastructure for profiling In TCC systems. TAPE extends the ... 
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We present Sequoia, a programming language designed to facilitate the development of 
memory hierarchy aware parallel programs that remain portable across modern machines 
featuring different memory hierarchy configurations. Sequoia abstractly exposes 
hierarchical memory in the programming model and provides language mechanisms to 
describe communication vertically through the machine and to localize computation to 
particular memory locations within it. We have implemented a complete programming 
sy ... 
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We propose a speculative multi-threading processor architecture called Pinot. Pinot 
exploits parallelism over a wide range of granularities without modifying program sources. 
Since exploitation of fine-grain parallelism suffers from limits of parallelism and overhead 
Incurred by parallelization, it is better to extract coarse-grain parallelism. Coarse-grain 
parallelism is biased in some programs (mainly, numerical ones) and some program 
portions. Therefore, exploiting both coarse- and fine-grain ... 
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We present a compiler for machines with an explicitly managed memory hierarchy and 
suggest that a primary role of any compiler for such architectures is to manipulate and 
schedule a hierarchy of bulk operations at varying scales of the application and of the 
machine. We evaluate the performance of our compiler using several benchmarks running 
on a Cell processor. 
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121 Power awareness: Combining compiler and runtime IPC predictions to reduce energy 
^ in next generation architectures 

^ Saurabh Chheda, Osman Unsal, Israel Koren, C. Man! Krishna, Csaba Andras Moritz 
April 2004 Proceedings of the 1st conference on Computing frontiers CF '04 
Publisher: ACM Press 

Full text available: ^pdf(336.02 KB) Additional Information: full citation , abstract , references , index terms 

Next generation architectures will require innovative solutions to reduce energy 
consumption. One of the trends we expect is more extensive utilization of compiler 
information directly targeting energy optimizations. As we show in this paper, static 
Information provides some unique benefits, not available with runtime hardware-based 
techniques alone. To achieve energy reduction, we use IPC information at various 
granularities, to adaptively adjust voltage and speed, and to throttle the fetch rat ... 

Keywords: adaptive voltage scaling, compiler architecture interaction, fetch throttling, 
instruction level parallelism, low power design 



122 A structural view of the Cedar programming environment 

^ Daniel C. Swinehart, PolleT. Zellweger, Richard J. Beach, Robert B. Hagmann 

^ August 1986 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 8 Issue 4 

Publisher: ACIVI Press 

Additional Information: full citation , abstract , references , citings , index 
terms 



Full text available: 'K )pdf(6.32 MB) 



This paper presents an overview of the Cedar programming environment, focusing on its 
overall structure— that is, the major components of Cedar and the way they are 
organized. Cedar supports the development of programs written in a single programming 
language, also called Cedar. Its primary purpose is to increase the productivity of 
programmers whose activities include experimental programming and the development of 
prototype software systems for a high-performance personal computer. T ... 



'•23 Generation and quantitative evaluation of dataflow clusters 
Lucas Roh, Walid A. Najjar, A. P. Wim Bohm 

July 1993 Proceedings of the conference on Functional programming languages and 
computer architecture FPCA '93 

Publisher: ACM Press 
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Improved multithreading techniques for hiding communication latency in 
multiprocessors 



♦ Bob Boothe, Abhiram Ranade 
April 1992 ACM SIGARCH Computer Architecture News , Proceedings of the 19th 

annual international symposium on Computer architecture ISCA '92, volume 

20 Issue 2 

Publisher: ACM Press 

Full text available' ' Rpclf(1.13 MB) Additional Information: full citation , abstract , references , citings , index 
^ terms 

Shared memory multiprocessors are considered among the easiest parallel computers to 
program. However building shared memory machines with thousands of processors has 
proved difficult because of the inevitably long memory latencies. Much previous research 
has focused on cache coherency techniques, but it remains unclear if caches can obtain 
sufficiently high hit rates. In this paper we present improved multithreading techniques 
that can easily tolerate latencies of hundreds of cycles, and y ... 

^•25 Performance comparison of ILP machines with cycle time evaluation 

Tetsuya Hara, Hideki Ando, Chikako Nakanishi, Masao Nakaya 
^ May 1996 ACM SIGARCH Computer Architecture News , Proceedings of the 23 rd 

annual international symposium on Computer architecture ISCA '96, volume 

24 Issue 2 
Publisher: ACM Press 

Full text available- f^WOMMB) Additional Information: full citation, abstract, references, citings, index 
' terms 

Many studies have investigated performance improvenrient through exploiting instruction- 
level parallelism (ILP) with a particular architecture. Unfortunately, these studies indicate 
performance improvement using the number of cycles that are required to execute a 
program, but do not quantitatively estimate the penalty imposed on the cycle time from 
the architecture. Since the performance of a microprocessor must be measured by its 
execution time, a cycle time evaluation is required as well as a cy ... 

126 Mobile and distributed systems: Enabling Java mobile computing on the IBM Jikes 
A research virtual machine 

^ Giacomo Cabri, Letizia Leonardi, Raffaele Quitadamo 

August 2006 Proceedings of the 4th international symposium on Principles and 
practice of programming in Java PPPJ *06 

Publisher: ACM Press 

Full text available: 'g ^pdff389.80 KB) Additional Information: full citation , abstract , references , index terms 

Today's complex applications must face the distribution of data and code among different 
network nodes. Java is a wide-spread language that allows developers to build complex 
software, even distributed, but it cannot handle the migration of computations (i.e. 
threads), due to intrinsic limitations of many traditional JVMs. After analyzing the 
approaches in literature, this paper presents our research work on the IBM Jikes Research 
Virtual Machine: exploiting some of its innovative VM techniques, ... 

Keywords: Java virtual machine, code mobility, distributed applications, thread 
persistence 



127 Compile-time dynamic voltage scaling settings: opportunities and limits 
^ Fen Xie, Margaret Martonosi, Sharad Malik 

^ May 2003 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 2003 conference 
on Programming language design and implementation PLDI '03, volume 38 

Issue 5 
Publisher: ACM Press 

I- II * ^ I ui 01 ^/oftH ofi ixD\ Additional Information: full citation , abstract , references , citings , index 
Full text available: ^pdf(291.26 KB) — 

With power- related concerns becoming dominant aspects of hardware and software 
design, significant research effort has been devoted towards system power minimization. 
Among run-time power-management techniques, dynamic voltage scaling (DVS) has 
emerged as an important approach, with the ability to provide significant power savings/ 
DVS exploits the ability to control the power consumption by varying a processor's supply 
voltage (V) and clocl< frequency (f). DVS controls energy by scheduling diffe ... 



Keywords: analytical model, compiler; dynamic voltage scaling, low power, mixed- 
Integer linear programming 



128 MGS: a multigrain shared memory system 
^ Donald Yeung, John Kubiatowicz, Anant Agarwal 

^ May 1996 ACM SIGARCH Computer Architecture News , Proceedings of the 23rd 

annual international symposium on Computer architecture ISCA '96, volume 

24 Issue 2 

Publisher: ACM Press 

Full text available* DClf( 1 37 MB) Additional Information: full citation , abstract , references , citings , index 

terms 

Parallel workstations, each comprising 10-100 processors, promise cost-effective general- 
purpose multiprocessing. This paper explores the coupling of such small- to medium-scale 
shared memory multiprocessors through software over a local area network to synthesize 
larger shared memory systems. We call these systems Distributed Scalable Shared- 
memory Multiprocessors (DSSMPs).This paper introduces the design of a shared memory 
system that uses multiple granularities of sharing, and presents an imp ... 

129 Dynamic languages symposium chair's welcome: Runtime synthesis of high- 
^ performance code from scripting languages 

^ Christopher Mueller, Andrew Lumsdaine 

October 2006 Companion to the 21st ACM SIGPLAN conference on Object-oriented 

programming systems, languages, and applications OOPSLA '06 
Publisher: ACM Press 

Full text available: ^pdff2.34 MB) Additional Information: full citation , abstract , references , index terms 

Scripting languages are ubiquitous in modern software engineering and are often used as 
the sole language for application development. However, some applications, specifically 
scientific and multimedia applications, often have small sections of code that require a 
higher level of performance than the host language can deliver. In many cases, the 
algorithm being optimized is simple and has a clear mapping to hardware resources. But, 
without introducing an intermediate language, developers general ... 

Keywords: Python, SIMD, chemical fingerprint, machine code, meta-programming, 
synthetic programming 

The berkeley software MPEG-1 video decoder 
Ketan Mayer- Patel, Brian C. Smith, Lawrence A. Rowe 

February 2005 ACM Transactions on Multimedia Computing, Communications, and 

Applications (TOMCCAP), Volume 1 Issue 1 
Publisher: ACM Press 

Full text available: "^ pafdSS MB) Additional Information: full citation , abstract , references , index terms 

This article reprises the description of the Berkeley software-only MPEG-1 video decoder 
originally published in the proceedings of the 1st International ACM Conference on 
Multimedia in 1993. The software subsequently became widely used in a variety of 
research systems and commercial products. Its main impact was to provide a platform for 
experimenting with streaming compressed video and to expose the strengths and 
weaknesses of software-only video decoding using general purpose computing archit ... 

Keywords: MPEG, Video compression 



131 Finding effective optimization phase sequences 

^ Prasad Kulkarni, Wankang Zhao, Hwashin Moon, Kyunghwan Cho, David Whalley, Jack 
^ Davidson, Mark Bailey, Yunheung Paek, Kyle Gallivan 

June 2003 ACM SIGPLAN Notices , Proceedings of the 2003 ACM SIGPLAN conference 
on Language, compiler, and tool for embedded systems LCTES '03, volume 

38 Issue ? 

Publisher: ACM Press 

Full text available: '<lS pdf(703.12 KB) Additional Information: full citation , abstract , references , citings , index 



terms 



It has long been known that a single ordering of optimization phases will not produce the 
best code for every application. This phase ordering problem can be more severe when 
generating code for embedded systems due to the need to meet conflicting constraints on 
time, code size, and power consumption. Given that many embedded application 
developers are willing to spend time tuning an application, we believe a viable approach is 
to allow the developer to steer the process of optimizing a function ... 

Keywords: genetic algorithms, interactive compilation, phase ordering 



132 A region-based compilation technique for a Java just-in-time compiler 
<^ Toshio Suganuma, Toshiaki Yasue, Toshio Nakatani 

^ May 2003 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 2003 conference 
on Programming language design and implementation PLDI '03, volume 38 

Issue 5 

Publisher: ACM Press 

Full text available: P| odfd 58.62 KB) Additional Information: full citation , abstract, references , citings, index 
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Method Inlining and data flow analysis are two major optimization components for 
effective program transformations, however they often suffer from the existence of rarely 
or never executed code contained In the target method. One major problem lies In the 
assumption that the compilation unit is partitioned at method boundaries. This paper 
describes the design and implementation of a region-based compilation technique in our 
dynamic compilation system, in which the compiled regions are selected a ... 

Keywords: dynamic compilers, on-stack replacement, partial inlining, region-based 
compilation 



133 Scientific computing on the Itanium "^"^ processor 

<^ Bruce Greer, John Harrison, Greg Henry, Wei Li, Peter Tang 

^ November 2001 Proceedings of the 2001 ACM/IEEE conference on Supercomputing 
(CDROM) Supercomputing '01 

Publisher: ACM Press 

Full text available: pdf(177.31 KB) Additional Information: full citation , abstract , references , index terms 

The 64-bit Intel® Itanium^"^ architecture is designed for high-performance scientific and 
enterprise computing, and the Itanium processor is its first silicon implementation. 
Features such as extensive arithmetic support, predication, speculation, and explicit 
parallelism can be used to provide a sound infrastructure for supercomputing. A large 
number of high-performance computer companies are offering Itanium^'^^-based systems, 
some capable of peak performance exceeding 50 GFLOPS. In ... 

Keywords: EPIC, fused multiply-add, Itanium (TM) processor, linear algebra, 
transcendental functions 



134 Multimedia and graphics: Enhancino loop buffering of media and telecommunications Q 
applications using low-overhead predication 
John W. Sias, Hillery C. Hunter, Wen-mei W. Hwu 

December 2001 Proceedings of the 34th annual ACM/IEEE international symposium 

on Microarchitecture MICRO 34 
Publisher: IEEE Computer Society 
Full text available' 

^pdf(1.37 MB)*Cp' Additional Information: full citation , abstract , references , citings 
Publisher Site 

Media- and telecommunications-focused processors, increasingly designed as deeply 
pipelined, statically-scheduled VUWs, rely on loop buffers for low-overhead execution of 
simple loops. Key loops containing control flow pose a substantial problem— full 
predication has a high encoding overhead, and partial predication techniques do not 
support if-conversion, the transformation of general acyclic control flow into predicated 
blocks. Using a set of significant media processing benchmarks, drawn fr ... 



135 Formalizing the safety of Java, the Java virtual machine, and Java card 
^ Pieter H. Hartel, Luc Moreau 

^ December 2001 ACM Computing Surveys (CSUR), volume 33 issue 4 
Publisher: ACM Press 

Full text available- Wl Dclf(442 86 KB) Additional Information: full citation , abstract , references , citings , index 
'^^i^-^ terms 

We review the existing literature on Java safety, ennphaslzing formal approaches, and the 
impact of Java safety on small footprint devices such as smartcards. The conclusion is 
that although a lot of good work has been done, a more concerted effort is needed to 
build a coherent set of machine-readable formal models of the whole of Java and its 
implementation. This is a formidable task but we believe it is essential to build trust in 
Java safety, and thence to achieve ITSEC level 6 or Common Crite ... 

Keywords: Common criteria, programming 



136 A simple method for extracting models for protocol code 

David Lie, Andy Chou, Dawson Engler, David L Dill 
^ May 2001 ACM SIGARCH Computer Architecture News , Proceedings of tiie 28th 

annual international symposium on Computer architecture ISCA '01, volume 

29 Issue 2 

Publisher: ACM Press 

Additional Information: full citation , abstract , references , citings , index 



Full text available: -- , , 

^ terms 

The use of model checking for validation requires that models of the underlying system be 
created. Creating such models is both difficult and error prone and as a result, verification 
is rarely used despite its advantages. In this paper, we present a method for 
automatically extracting models from low level software implementations. Our method is 
based on the use of an extensible compiler system, xg++, to perform the extraction. The 
extracted model is combined with a model of the ha ... 

'■^^ Area-Performance trade-offs in Tiled Dataflow Architectures 

Steven Swanson, Andrew Putnam, Martha Mercaldi, Ken Michelson, Andrew Petersen, 
Andrew Schwerin, l^ark Oskin, Susan J. Eggers 

May 2006 ACM SIGARCH Computer Architecture News , Proceedings of the 33rd 

annual international symposium on Computer Architecture ISCA '06, volume 

34 Issue 2 

Publisher: IEEE Computer Society, ACM Press 

Full text available: ' ^pdf(487.22 KB) Additional Information: full citation , abstract , citings , index terms 

Tiled architectures, such as RAW, Smarti^emories, TRIPS, and WaveScalar, promise to 
address several issues facing conventional processors, including complexity, wire-delay, 
and performance. The basic premise of these architectures is that larger, higher- 
performance implementations can be constructed by replicating the basic tile across the 
chip. This paper explores the area-performance trade-offs when designing one such tiled 
architecture, WaveScalar. We use a synthesizable RTL model and cycle-le ... 

Keywords: WaveScalar, Dataflow computing, ASIC, RTL 



138 An efficient implementation of Java's remote method invocation 

Jason l^aassen, Rob van Nieuwpoort, Ronald Veldema, Henri E. Bal, Aske Plaat 
^ May 1999 ACM SIGPLAN Notices , Proceedings of the seventh ACM SIGPLAN 

symposium on Principles and practice of parallel programming PPoPP '99, 

Volume 34 Issue 8 

Publisher: ACM Press 
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Java offers interesting opportunities for parallel computing. In particular, Java Remote 
Method Invocation provides an unusually flexible kind of Remote Procedure Call. Unlike 
RPC, RMI supports polymorphism, which requires the system to be able to download 



4^ 



remote classes into a running application. Sun's RMI implementation achieves this kind of 
flexibility by passing around object type information and processing it at run time, which 
causes a major run time overhead. Using Sun's JDK 1.1.4 on a P ... 



139 Syntliesis and Design Tools: A compiler framework for mapping applications to a | 
A coarse-grained reconfigurable computer architecture 

^ Girlsh Venkataramani, Walid Najjar, Fadi Kurdahi, Nader Bagherzadeh, Wim Bohm 
November 2001 Proceedings of the 2001 international conference on Compilers, 

architecture, and synthesis for embedded systiems CASES '01 
Publisher: ACM Press 

Full text available* W\ pdf(304 22 KB) Additional Information: full citation , abstract , references , citings , index 
^ ternns 

The rapid growth of silicon densities has made it feasible to deploy reconfigurable 
hardware as a highly parallel computing platform. However, in most cases, the application 
needs to be programmed in hardware description or assembly languages, whereas most 
application ^programmers are familiar with the algorithmic programming paradigm. SA-C 
has been proposed as an expression-oriented language designed to implicitly express data 
parallel operations. Morphosys is a reconfigurable system-on-chip arc ... 

1*^* The STAMPede approach to thread-level speculation | 
^ J. Gregory Steffan, Christopher Colohan, Antonia Zhai, Todd C. Mowry 

August 2005 ACM Transactions on Computer Systems (TOCS), volume 23 issue 3 

Publisher: ACIVI Press 

Full text available- fj ^Ddf(1.72 MB) Additional Information: full citation, abstract, references, citings, index 
' ^ terms 

Multithreaded processor architectures are becoming increasingly coninnonplace: many 
current and upcoming designs support chip multiprocessing/ simultaneous multithreading, 
or both. While it is relatively straightforward to use these architectures to improve the 
throughput of a multithreaded or multiprogrammed workload, the real challenge is how to 
easily create parallel software to allow single programs to effectively exploit all of this raw 
performance potential. One promising technique fo ... 

Keywords: Thread-level speculation, automatic parallelization, cache coherence, chip- 
multiprocessing 
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Thre ad partitioning and schedulin g based on cost mod el Q 
Xinan Tang, J. Wang, Kevin B. Theobald, Guang R. Gao 

June 1997 Proceedings of the ninth annual ACM symposium on Parallel algorithms 
and architectures SPAA '97 

Publisher: ACM Press 
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How "hard" is thread partitio nin g and how "bad" is a list scheduling based partitionin g 
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Xinan Tang, Guang R. Gao 

June 1998 Proceedings of the tenth annual ACM symposium on Parallel algorithms 
and architectures SPAA '98 

Publisher: ACM Press 
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Thre ad partitioning method fo r hardware connpiler bach 

Mizuki Takahashi, Nagisa Ishiura, Akihisa Yamada, Takashi Kambe 

January 2000 Proceedings of the 2000 conference on Asia South Pacific design 

automation ASP-DAC 'GO 
Publisher: ACM Press 
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Cas e studies in embedded system design: A detailed cost model for concurrent use 
with hardware/software co-desi gn 
Daniel Ragan, Peter Sandborn, Paul Stoaks 

June 2002 Proceedings of the 39th conference on Design automation DAC '02 

Publisher: ACM Press 

Full text available- If!) pdf(278 42 KB ) Additional Infonmation: full citation , abstract , references , citings , index 
^ terms 

Hardware/software co-design nnethodologies generally focus on the prediction of system 
performance or co-verlflcatlon of system functionality. This study extends this 
conventional focus through the development of a methodology and software tool that 
evaluates system (hardware and software) development, fabrication, and testing costs 
(dollar costs) concurrent with hardware/software partitioning In a co-design 
environment. Based on the determination of key metrics such as gate count and lines of 
sof ... 
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Models and languages for parallel computation 
David B. Skillicorn, Domenico Talia 

June 1998 ACM Computing Surveys (CSUR), volume 30 issue 2 
Publisher: ACM Press 
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We survey parallel programming models and languages using six criteria to assess their 
suitability for realistic portable parallel programming. We argue that an ideal model 
should by easy to program, should have a software development methodology, should be 
architecture-independent, should be easy to understand, should guarantee performance, 
and should provide accurate information about the cost of programs. These criteria reflect 
our belief that developments in parallelism must be driven b ... 

Keywords: general-purpose parallel computation, logic programming languages, object- 
oriented languages, parallel programming languages, parallel programming models, 
software development methods, taxonomy 
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Steve S.W. Liao, Perry H. Wang, Hong Wang, Gerolf Hoflehner, Daniel Lavery, John P. Shen 
May 2002 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 2002 Conference 
on Programming language design and implementation PLDI '02, Volume 37 

Issue 5 
Publisher: ACM Press 
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terms 

Recently, a number of thread -based prefetching techniques have been proposed. These 
techniques aim at improving the latency of single-threaded applications by leveraging 
multithreading resources to perform memory prefetching via speculative prefetch threads. 
Software-based speculative precomputation (SSP) is one such technique, proposed for 
multithreaded Itanium models. SSP does not require expensive hardware support-instead 
it relies on the compiler to adapt binaries to perform prefetching on o ... ' 

Keywords: chaining speculative precomputation, delay minimization, dependence 
reduction, long-range thread-based prefetching, loop rotation, pointer, post-pass, 
prediction, scheduling, slacl<, slicing, speculation, triggering 
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May 2000 Proceedings of the 14th international conference on Supercomputing ICS 
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Publisher: ACM Press 
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Multithreaded architectures are emerging as an important class of parallel machines. By 
allowing fast context switching between threads on the same processor, these systems 
hide communication and synchronization latencies and allow scalable parallelism for 



dynamic and irregular applications. Thread partitioning is the most important task in 
compiling high-level languages for multithreaded architectures. Non-preemptive 
multithreaded architectures, which can be built from off-the-shelf compon ... 
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Parallel computing is increasingly important in the solution of large-scale numerical 
problems. The difficulty of efficiently hand-coding parallelism, and the limitations of 
parallelizing compilers, have nonetheless restricted Its use by scientific programmers. In 
this paper we propose a new paradigm, chores, for the run-time support of parallel 
computing on shared-memory multiprocessors. We consider specifically uniform memory 
access shared-memory environments, ... 

Helper threads via virtual multithreading on an experimental itanium^2 processor- 
^ b ased platform 

Perry H. Wang, Jamison D. Collins, Hong Wang, Dongkeun Kim, Bill Greene, Kal-Ming Chan, 
Aamir B. Yunus, Terry Sych, Stephen F. Moore, John P. Shen 

October 2004 ACM SIGPLAN Notices , ACM SIGOPS Operating Systems Review , ACM 
SIGARCH Computer Architecture News , Proceedings of the 11th 
international conference on Architectural support for programming 
languages and operating systems ASPLOS-XI, volume 39 , 38 , 32 issue 11,5,5 

Publisher: ACM Press 
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Helper threading is a technology to accelerate a program by exploiting a processor's 
multithreading capability to run ' 'assist" threads. Previous experiments on hyper- 
threaded processors have demonstrated significant speedups by using helper threads to 
prefetch hard-to-predict delinquent data accesses. In order to apply this technique to 
processors that do not have built-in hardware support for multithreading, we introduce 
virtual multithreading (VMT), a novel form of switch-on-event user-level ... 

Keywords: DB2 database, PAL, cache miss prefetching, helper thread, itanlum processor, 
multithreading, switch-on-event 



UIVIL-based multiproc es sor SoC design framework 
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This paper describes a complete design flow for multiprocessor systems-on-chips (SoCs) 
covering the design phases from system-level modeling to FPGA prototyping. The design 
of complex heterogeneous systems is enabled by raising the abstraction level and 
providing several system-level design automation tools. The system is modeled in a UML 
design environment following a hew UML profile that specifies the practices for orthogonal 
application and architecture modeling. The design flow tools are gov ... 
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The simulation system JAMES II is aimed at supporting a range of modeling formalisms 
and simulation engines. The partitioning of nhodels is essential for distributed simulation. 
A suitable partition depends on model, hardware, and simulation algorithm 
characteristics. Therefore, a partitioning layer has been created in JAMES II which allows 
to plug in partitioning algorithms on demand. Three different partitioning algorithms have 
been implemented. In addition to the well known Kernighan-Lin algor ... 

An object-based program mi ng model for shared data 
Gail E. Kaiser, Brent Hailpern 
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Volume 14 Issue 2 
Publisher: ACM Press 
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The classical object model supports private data within objects and clean interfaces 
between objects, and by definition does not permit sharing of data among arbitrary 
objects. This is a problem for real -world applications, such as advanced financial services 
and integrated network management, where the same data logically belong to multiple 
objects and may be distributed over multiple nodes on the network. Rather than give up 
the advantages of encapsulated objects in modeling real-world en ... 

Keywords: coordination language, daemons, financial applications, object-based, real- 
time, sharing 
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Large-scale Non-Uniform Memory Access (NUMA) multiprocessors are gaining increased 
attention due to their potential for achieving high performance through the replication of 
relatively simple components. Because of the complexity of such systems, scheduling 
algorithms for parallel applications are crucial in realizing the performance potential of 
these systems. In particular, scheduling methods must consider the scale of the system, 
with the increased likelihood of creating bottlenecks, along wi ... 
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Lingkun Chu, Hong Tang, Tao Yang, Kai Shen 
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symposium on Principles and practice of parallel programming PPoPP '03, 

Volume 38 Issue 10 
Publisher: ACM Press 
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review 

Large-scale cluster-based Internet services often host partitioned datasets to provide 
incremental scalability. The aggregation of results produced from multiple partitions is a 
fundamental building block for the delivery of these services. This paper presents the 
design and implementation of a programming primitive — Data Aggregation Call (DAC) — 
to exploit partition parallelism for cluster-based Internet services. A DAC request specifies 
a local processing operator and a global reduction ope ... 

Keywords: cluster-based network services, fault tolerance, load-adaptive tree formation, 
response time, scalable data aggregation, throughput 
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Publisher: ACM Press 
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This paper presents secure progrann partitioning, a language-based technique for 
protecting confidential data during connputation in distributed systems containing mutually 
untrusted hosts. Confidentiality and integrity policies can be expressed by annotating 
programs with security types that constrain information flow; these programs can then be 
partitioned automatically to run securely on heterogeneously trusted hosts. The resulting 
communicating subprograms collectively implement the original p ... 

Keywords: Confidentiality, declassification, distributed systems, downgrading, integrity, 
mutual distrust, secrecy, security policies, type systems 
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June 2005 ACM SIGPLAN Notices , Proceedings of the 2005 ACiwi SIGPLAN conference 
on Programming language design and Implementation PLDI '05, Volume 40 
Issue 6 

Publisher: ACM Press 
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• terms 

Speculative parallelization can provide significant sources of additional thread-level 
parallelism, especially for irregular applications that are hard to parallelize by 
conventional approaches. In this paper, we present the Mitosis compiler, which partitions 
applications Into speculative threads, with special emphasis on applications for which 
conventional parallelizing approaches fail.The management of inter-thread data 
dependences is crucial for the performance of. the system. The Mitosis fram ... 

Keywords: automatic parallelization, pre-computation slices, speculative multithreading, 
thread-level parallelism 
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34 Issue 2 
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Full text available: ^ pdf(82 3.43 KB) Additional Information: full citation , abstract , index terms 

The key to high performance in Simultaneous Multithreaded (SMT) processors lies in 
optimizing the distribution of shiared resources to active threads. Existing resource 
distribution techniques optimize performance only indirectly. They infer potential 
performance bottlenecks by observing Indicators, like instruction occupancy or cache miss 
counts, and take actions to try to alleviate them. While the corrective actions are 
designed to Improve performance, their actual performance Impact is not kno ... 

Minimum cost ada pti ve synchronization: experiments with the ParaSol system 

Edward Mascarenhas, Felipe Knop, Reuben Pasquini, Vernon Rego 

October 1998 ACM Transactions on Modeling and Computer Simulation (TOMACS), 

Volume 8 Issue 4 
Publisher: ACM Press 

Full text available- 1?! pdf (265.07 KB) Additional Information: full citation, abstract , references , citing s, index 
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We present a novel adaptive synchronization algorithm, called the minimum average cost 
(MAC) algorithm, in the context of the parasol parallel simulation system. ParaSol is a 



multithreaded system for parallel simulation on shared- and distributed-memory 
environments, designed to support domain-specific Simulation Object Libraries. The 
proposed MAC algorithm is based on minimizing the cost of synchronization delay and 
rollback at a process, whenever its simulation driver must decide whether ... 

Keywords: ParaSol, adaptive synchronization, optimal delay, optimistic synchronization, 
parallel and distributed simulation, stochastic simulation, thread 



20 Feature-Based D ecom position of Inductive P roof s Applied to Real-Time Avion ics 
Software: An Exper ie nce Report 

Vu Ha, Murali Rangarajan, Darren Cofer, Harald Rues, Bruno Dutertre 

May 2004 Proceedings of the 26th International Conference on Software 

Engineering ICSE '04 
Publisher: IEEE Computer Society 

Full text available: ^pdf(253.1 9 KB) Additional Information: ful l citation , abstract , index terms 

The hardware and software in modern aircraft controlsystems are good candidates for 
verification using fornnal methods: they are complex, safety-critical, and challengethe 
capabilities of test-based verification strategies. Wehave previously reported on our use of 
model checking tpverify the time partitioning property of the Deosd real-timeoperating 
system for embedded avionics. The size and complexityof this system have limited us to 
analyzing only oneconfiguration at a time. To overcome this Mm ... 
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21 A cost-driven compilation framework for specula tive parallelization of sequential 
^ pro grams 

^ Zhao-Hui Du, Chu-Cheow Lim, Xiao-Feng Li, Chen Yang, Qingyu Zhao, TIn-Fook Ngal 

June 2004 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 2004 conference 
on Programming language design and implementation PLDI '04, volume 39 

Issue 6 
Publisher: ACM Press 

Additional Information: full citation , abstract , references , citings , index 
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Full text available: g pdf(235. 14 KB) 



The emerging hardware support for thread-level speculation opens new opportunities to 
parallelize sequential programs beyond the traditional limits. By speculating that many 
data dependences are unlikely during runtime, consecutive iterations of a sequential loop 
can be executed speculatively in parallel. Runtime parallelism is obtained when the 
speculation is correct. To take full advantage of this new execution model, a program 
needs to be programmed or compiled in such a way that it exhibits ... 

Keywords: cost-driven compilation, loop transformation, speculative multithreading, 
speculative piarallel threading, speculative parallelization, thread-level speculation 



22 Connectivity-based g arbage collection 
^ Martin HIrzel, Amer Diwan, Matthew Hertz 

>^ October 2003 ACM SIGPLAN Notices , Proceedings of the 18th annual ACM SIGPLAN 
conference on Object-oriented programing, systems, languages, and 
applications OOPSLA '03, volume 38 issue ii 
Publisher: ACM Press 

Additional Information: full citation , abstract , references , citin gs, index 
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Full text available: ■gpdf (521:65 KB) 



We introduce a new family of connectivity-based garbage collectors (Cbgc) that are based 
on potential object-connectivity properties. The key feature of these collectors is that the 
placement of objects into partitions is determined by performing one of several forms of 
connectivity analyses on the program. This enables partial garbage collections, as in 
generational collectors, but without the need for any write barrier.The contributions of this 
paper are 1) a novel family of garbage c ... 

Keywords: connectivity based garbage collection 



23 Fast detection of comnnunication patterns in distributed executions 
Thomas Kunz, Michiel F. H. Seuren 

November 1997 Proceedings of the 1997 conference of the Centre for Adva need 
Studies on Collaborative research CASCON '97 

Publisher: IBM Press 
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Understanding distributed applications is a tedious and difficult task. Visualizations based 
on process-time diagrams are often used to obtain a better understanding of the 
execution of the application. The visualization tool we use is Poet, an event tracer 
developed at the University of Waterloo. However, these diagrams are often very complex 
and do not provide the user with the desired overview of the application. In our 
experience, such tools display repeated occurrences of non-trivial commun ... 

24 Verification of time partitionin g in the DEOS sc heduler kernel 

John Penix, Willem Visser, Eric Engstrom, Aaron Larson, Nicholas Weininger 
June 2000 Proceedings of the 22nd international conference on Software 

engineering ICSE '00 
Publisher: ACM Press 

Full text available' t?l pdf (111 58 KB) Additional Information: f ull citation , abstract , references , citing s, index 
^ terms 

This paper describes an experiment to use the Spin model checking system to support 
automated verification of time partitioning in the Honeywell DEOS real-time scheduling 
kernel. The goal of the experiment was to investigate whether model checking could be 
used to find a subtle implementation error that was originally discovered and fixed during 
the standard formal review process. To conduct the experiment, a core slice of the DEOS 
scheduling kernel was first translated without abstraction ... 

25 LoGPC: mo deling network contention in mes saq e-passinQ programs 
Csaba Andras Moritz, Matthew I. Frank 

^ June 1998 ACM SIGMETRICS Performance Evaluation Review , Proceedings of the 
1998 ACM SIGMETRICS joint international conference on Measurement 
and modeling of computer systems SIGMETRICS '98/PERFORMANCE '98, 
Volume 26 Issue 1 
Publisher: ACM Press 

Full text available* If) pdf( 1 41 MB) Additional Information: full citation , abstract , references / citings, index 

terms 

In many real applications, for example those with frequent and irregular communication 
patterns or those using large messages, network contention and contention for message 
processing resources can be a significant part of the total execution time. This paper 
presents a new cost rriodel, called LoGPC, that extends the LogP [9] and LogGP [4] 
models to account for the impact of network contention and network interface DMA 
behavior on the performance of message-passing programs. We validate LoGPC by a ... 

26 A model fo r datafl ow based vector executio n 
^ W, Marcus Miller, Walid A. Najjar, A. P. Wim Bohm 

^ July 1994 Proceedings of the 8th international conference on Supercomputing ICS 
•94 

Publisher: ACM Press 

Full text available: Q pdf( 1.28 MB) Additional Information: full citation , abstract , references , index terms 

Although the dataflow model has been shown to allow the exploitation of parallelism at all 
levels, research of the past decade has revealed several fundamental problems: 
Synchronization at the instruction level, token matching, coloring and re-labeling 
operations have a negative impact on performance by significantly increasing the number 
of non-compute ''overhead" cycles. Recently, many novel Hybrid von-Neumann Data 
Driven machines have been proposed to alleviate some of these p ... 

27 Query eval uation techniques for large data b ases 
Goetz Graefe 

June 1993 ACM Computing Surveys (CSUR), volume 25 issue 2 
Publisher: ACM Press 

Full text available- pdf (9 37 MB) Additional Information: full citation , abstract , references , citings, index 

terms , review 

Database management systems will continue to manage large data volumes. Thus, 
efficient algorithms for accessing and manipulating large sets and sequences will be 
required to provide acceptable performance. The advent of object-oriented and extensible 
database systems will not solve this problem. On the contrary, modern data models 
exacerbate the problem: In order to manipulate large sets of complex objects as 



efficiently as today's database systems manipulate simple records, query-processi 



Keywords: complex query evaluation plans, dynamic query evaluation plans, extensible 
database systems, iterators, object-oriented database systems, operator model of 
parallelization, parallel algorithms, relational database systems, set-matching algorithms, 
sort-hash duality 
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Sung-Eui Yoon, Brian Salomon, Russell Gayle, Dinesh Manocha 
October 2004 Proceedings of the conference on Visualization '04 VIS '04 
Publisher: IEEE Computer Society 
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We present a novel approach for interactive view-dependent rendering of massive 
models. Our algorithm combines view-dependent simplification, occlusion culling, and out- 
of-core rendering. We represent the model as a clustered hierarchy of progressive meshes 
(CHPM). We use the cluster hierarchy for coarse-grained selective refinement and 
progressive meshes for fine-grained local refinement. We present an out-of-core 
algorithm for computation of a CHPM that includes cluster deconrtposltion, hierarch ... 

Keywords: Interactive display, view-dependent rendering, occlusion culling, external- 
memory algorithm, levels-of-detail 
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David E. Culler, Anurag Sah, Klaus E. Schauser, Thorsten von Elcken, John Wawrzynek 
April 1991 ACM SIGARCH Computer Archi tecture News , ACM SIGOPS Operating 
Systems Review , ACM SIGPLAN Notices , Proceedings of the fourth 
international conference on Architectural support for programming 
languages and operating systems ASPLOS-IV, volume 19 , 25 , 26 issue 2 , Special 

Issue , 4 
Publisher: ACM Press 

Full text available: ^ pdf(1. 41 MB) Additional Information: full citation , references , citings, index terms 



30 Automaticall y partitionin g packet processin g ap plications for pipelined archi tectures 
JInquan Dai, Bo Huang, Long Li, Luddy Harrison 

June 2005 ACM SIGPLAN Notices , Proceedings of the 2005 ACM SIGPUVN conference 
on Programming language design and implementation PLDI '05, volume 40 

Issue 6 
Publisher: ACM Press 

Full text available: pclf(5 41.83 KB) Additional Information: full citation , abstract, references, citjngs, index 
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Modern network processors employs parallel processing engines (PEs) to keep up with 
explosive internet packet processing demands. Most network processors further allow 
processing engines to be organized in a pipelined fashion to enable higher processing 
throughput and flexibility. In this paper, we present a novel program transformation 
technique to exploit parallel and pipelined computing power of modern network 
processors. Our proposed method automatically partitions a sequential packet proces ... 

Keywords: live-set transmission, network processor, packet processing, parallel, 
pipelining transformation, program partition 

31 The costs and lim it s of availability for replicated services 
Haifeng Yu, Amin Vahdat 

February 2006 ACM Transactions on Computer Systems (TOCS), volume 24 issue i 
Publisher: ACM Press 

Full text available: ^ pdf ( 718.65 KB) Additional Information: full citation , abstract , references , index terms 



As raw system performance continues to improve at exponential rates, the utility of many 
services is increasingly limited by availability rather than performance. A key approach to 
improving availability involves replicating the service across multiple, wide-area sites. 
However, replication introduces well-known trade-offs between service consistency and 
availability. Thus, this article explores the benefits of dynamically trading consistency for 
availability using a continuous consistency mo ... 

Keywords: Availability, continuous consistency, network services, replication, trade-off, 
upper bound 



32 Automa ting re al-tinne multi-threaded a p plication development 
C. Riva, M. Krieger 

November 1995 Proceedings of the 1995 conference of the Centre for Advanced 
Studies on Collaborative research CASCON '95 

Publisher: IBM Press 

Full text available: ^ pdf(89,57 KB) Additional Information: full citation , abstract , references , index terms 

Over the past decade, the advances nnade in VLSI devices have driven the development 
of cost-effective, symmetric multiprocessor architectures. In turn, operating systems have 
evolved to take advantage of the true concurrency offered by these architectures. 
However, because of the lack of adequate development tools, delivery of the intrinsic 
benefits of these platforms directly to applications has proven elusive. This has been 
particularly the case for the use of these systems in real-time applic ... 

33 A hybrid ex ecution model for fine-grained languages on distributed memory 
multicom puters 

John Plevyak, Vijay Karamcheti, Xingbin Zhang, Andrew A. Chien 
December 1995 Proceedings of the 1995 ACM/IEE E conference on Supercomputing 
(CDROM) - Volume 00 Supercomputing '95 

Publisher: ACM Press 

Full text available: |g) html(59.68 KB) Additional Information: full citation , abstract , references , citings, index 
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While fine-grained concurrent languages can naturally capture concurrency in many 
irregular and dynamic problems, their flexibility has generally resulted in poor execution 
effciency. In such languages the computation consists of many small threads which are 
created dynamically and synchronized Implicitly. In order to minimize the overhead of 
these operations, we propose a hybrid execution model which dynannically adapts to 
runtime data layout, providing both sequential efficiency and low overhea ... 

3^ Supportin g d ynamic data structures on distributed-memory machines 
Anne Rogers, Martin C. Carlisle, John H. Reppy, Laurie 3. Hendren 

March 1995 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 17 Issue 2 
Publisher: ACM Press 

Full text available* f g| pdf(2 05 MB) Additional Information: full citation , abstract , references , citing s, index 

terms , review 

Compiling for distributed-mennory machines has been a very active research area in 
recent years. Much of this work has concentrated on programs that use arrays as their 
primary data structures. To date, little work has been done to address the problem of 
supporting programs that use pointer-based dynamic data structures. The techniques 
developed for supporting SPMD execution of array-based programs rely on the fact that 
arrays are statically defined and directly addressable. Recursive data s ... 

Keywords: dynamic data structures 



35 Automatic T hread Extraction with Decoupled Software Pipelining 
Guilherme Ottoni, Ram Rangan, Adam Stoler, David 1. August 

November 2005 Proceedings of the 38th annual IEEE/ACM International Symposium 

on Microarchitecture MICRO 38 
Publisher: IEEE Computer Society 



. Full text available: ^ pdf(538.89 KB) Additional Information: full citation , abstract , citjngs, i ndex terms 
^ Publisher Site 

Until recently, a steadily rising clock rate and other uniprocessor microarchitectural 
Innprovements could be relied upon to consistently deliver increasing performance for a 
wide range of applications. Current difficulties in maintaining this trend have lead 
microprocessor manufacturers to add value by Incorporating multiple processors on a 
chip. Unfortunately, since decades of compiler research have not succeeded in delivering 
automatic threading for prevalent code properties, this approach dem ... 

36 The stat e of t he art in distributed q uery p rocessing 
Donald Kossmann 

December 2000 ACM Computing Surveys (CSUR), Volume 32 issue 4 
Publisher: ACM Press 

Full text available: ■m pdf(455.39 KB) Additional Information: full citation , abstract, references, citings, index 
^ terms 

Distributed data processing is becoming a reality. Businesses want to do it for many 
reasons, and they often must do it in order to stay competitive. While much of the 
infrastructure for distributed data processing is already there (e.g., modern networl< 
technology), a number of issues make distributed data processing still a complex 
undertaking: (1) distributed systems can become very large, involving thousands of 
heterogeneous sites including PCs and mainframe server machines; (2) the stat ... 

Keywords: caching, client-server databases, database application systems, 
dissemination-based information systems, economic models for query processing, 
middleware, multitier architectures, query execution, query optimization, replication, 
wrappers 



37 Convert ing t hread-level parallelism to in struction-level parallelism via simultaneous 
^ multithrea ding 
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August 1997 ACM Transactions on Computer Systems (TOCS), volume is issue 3 
Publisher: ACM Press 

Full text available; « pdf(526.39 KB) Additional Information: full citation , abstract, references, dtings, index . 
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To achieve high performance, contemporary computer systems rely on two forms of 
parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide- 
issue super-scalar processors exploit ILP by executing multiple instructions from a single 
program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads 
in parallel on different processors. Unfortunately, both parallel processing styles statically 
partition processor resources, thus preventing t ... 

Keywords: cache interference, instruction-level parallelism, multiprocessors, 
multithreading, simultaneous multithreading, thread-level parallelism 
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June 2005 Proceedings of the 3rd international conference on Mobile systems, 
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Publisher: ACM Press 

Full text available: ^ pdf(261.28 KB) Additional Infonnation: full citation , abstract , refer ences , citings 

In this paper, we describe the design and implementation of a distributed operating 
system for ad hoc networks. Our system simplifies the programming of ad hoc networks 
and extends total system lifetime by making the entire network appear as a single virtual 
machine. It automatically and transparently partitions applications into components and 
dynamically finds them a placement on nodes within the network to reduce energy 
consumption and to Increase system longevity. This paper describes our pr ... 



Parallel ex ecution of prolo g programs: a survey 
^ Gopal Gupta, Enrico Pontelli, Khayri A.M. Ali, Mats Carlsson, Manuel V. Hermenegildo 
^ July 2001 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 23 Issue 4 

Publisher: ACM Press 
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Since the early days of logic programming, researchers in the field realized the potential 
for exploitation of parallelism present in the execution of logic programs. Their high-level 
nature, the presence of nondeterminism, and their referential transparency, among other 
characteristics, make logic programs interesting candidates for obtaining speedups 
through parallel execution. At the same time, the fact that the typical applications of logic 
programming frequently involve irregular conniputatio ... 

Keywords: Automatic parallelization, constraint programming, logic programming, 
parallelism, prolog 
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Jorg Haber, Demetri Terzopoulos 

August 2004 ACM SIGGRAPH 2004 Course Notes SIGGRAPH '04 

Publisher: ACM Press 

Full text available: ^ pdf(18. 15 MB) Additional Information: full citation , abstract 

In this course we present an overview of the concepts and current techniques in facial 
modeling and animation. We introduce this research area by its history and applications. 
. As a necessary prerequisite for facial modeling, data acquisition is discussed in detail. We 
describe basic concepts of facial animation and present different approaches including 
parametric models, performance-, physics-, and learning-based methods. State-of-the-art 
techniques such as muscle-based facial animation, mass-s ... 
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A survey of p rocessors with explicit multith reading 
Theo Ungerer, Borut Robic, Jurij Silc 

March 2003 ACM Computing Surveys (CSUR)/ volume 35 issue i . 
Publisher: ACM Press 
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Full text available: 



Hardware multithreading is becoming a generally applied technique in the next generation 
of microprocessors. Several multithreaded processors are announced by industry or 
already into production In the areas of high-performance microprocessors, media, and* 
network processors.A multithreaded processor Is able to pursue two or more threads of 
control in parallel within the processor pipeline. The contexts of two or more threads of 
control are often stored In separate on-chip register sets. Unused i ... 

Keywords: Blocked multithreading, interleaved multithreading, simultaneous 
multithreading 
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*3 Waiting algorithms for synchronization in large -scale multiprocessors 
Beng-Hong Lim, Anant Agarwal 
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Through analysis and experiments, this paper investigates two-phase waiting algorithms 
to mininnize the cost of waiting for synchronization in large-scale multiprocessors. In a 
two-phase algorithm, a thread first waits by polling a synchronization variable. If the cost 
of polling reaches a limit Lpoll and further waiting is necessary, the thread is blocked, 
Incurring an additional fixed cost, B. The choice of Lpoll 

Keywords: barriers, blocking, competitive analysis, locks, producer-consumer 
synchronization, spinning, waiting time 
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Design and evaluation of a conit-based continuous consistency model for replicated 



services 
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August 2002 ACM Transactions on Computer Systems (TOCS), volume 20 issue 3 
Publisher: ACM Press 
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The tradeoffs between consistency, perfornnance, and availability are well understood. 
Traditionally, however, designers of replicated systems have been forced to choose from 
either strong consistency guarantees or none at all. This paper explores the semantic 
space between traditional strong and optimistic consistency models for replicated 
services. We argue that an important class of applications can tolerate relaxed 
consistency, but benefit from bounding the maximum rate of Inconsistent access ... 

Keywords: Conit, consistency model, continuous consistency, network services, relaxed 
. consistency, replication 
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The growing complexity and the safety-critical requirements of the embedded software In 
avionics systems present many challenges to current test-based verification technology. 
The use of formal verification methods can Increase design assurance by exploring a 
larger range of system behaviors and fault conditions than can feasibly be covered by 
testing or simulation. However, one of the most challenging tasks faced in any formal 
verification activity is the construction of an adequate model fo ... 

*6 Run-time adaptation in river 
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February 2003 ACM Transactions on Computer Systems (TOCS), volume 21 issue 1 
Publisher: ACM Press 

Full text available: Q pdf(849>04 KB) Additional Information: full citation , abstract , reference s, index terms 

We present the design, implementation, and evaluation of run-time adaptation within the 
River dataflow programming environment. The goal of the River system is to provide 
adaptive mechanisms that allow database query-processing applications to cope with 
performance variations that are common in cluster platforms. We describe the system and 
its basic mechanisms, and carefully evaluate those mechanisms and their effectiveness. 
In our analysis, we answer four previously unanswered and important que ... 

Keywords: Performance availability, clusters, parallel I/O, performance faults, robust 
performance, run-time adaptation 
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The TOSCA environment for hardware/software co-design of control dominated systems 
implemented on a single chip includes a novel approach to the system exploration phase 
for the evaluation of alternative architectures. The paper presents the metrics and the 
partitioning algorithm defined for the identification of the best hardware and software 
bindings and modularization, given the design constraints and goals. The system 
exploration phase is implemented as an iterative process directed by the u ... 
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Addition al presentations from the 2 4th course are available on the citation 
page 

Mashhuda Glencross, Alan G. Chalmers, Ming C. Lin, Miguel A. Otaduy, Diego Gutierrez 
July 2006 ACM SIGGRAPH 2006 Courses SIGGRAPH '06 
Publisher: ACM Press 
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The objective of this course is to provide an Introduction to the Issues that must be 
considered when building high-fidelity 3D engaging shared virtual environments. The 
principles of human perception guide important development of algorithms and 
techniques in collaboration, graphical, auditory, and haptic rendering. We aim to show 
how human perception is exploited to achieve realism in high fidelity environments within 
the constraints of available finite computational resources. In this course w ... 

Keywords: collaborative environments, haptics, high-fidelity rendering, human-computer 
interaction, multi-user, networked applications, perception, virtual reality 



49 Analy tical cac he models with applications to cache partitionin g | 
G. Edward Suh, Srinlvas Devadas, Larry Rudolph 

June 2001 Proceedings of the 15th international conference on Supercomputing ICS 
01 

Publisher: ACM Press 

Full text available: fg l pdf(854.26 KB) Additional Information: full citation , abstract , refer.ences. citings, index 

^ : terms 

An accurate, tractable, analytic cache model for time-shared systems is presented, which 
estimates the overall cache miss-rate of a multiprocessing system with any cache size and 
time quanta. The input to the model consists of the isolated miss-rate curves for each 
process, the time quanta for each of the executing processes, and the total cache size. 
The output is the overall miss-rate. Trace-driven simulations demonstrate that the 
estimated miss-rate Is very accurate. Since the model provid ... 

50 Modelin g and evaluation of hardware /soft ware designs | 
Neal K. Tibrewala, JoAnn M. Paul, Donald E. Thomas 

April 2001 Proceedings of the ninth international symposium on Hardware/software 
codesign CODES '01 

Publisher: ACM Press 

Full text available* 1?l Ddfl703 44 KB) A^^'^^'^^' Information: full citation , abstract , references , citings , index 

terms 

We introduce the foundation of a system modeling environment targeted at capturing the 
anticipated interactions of hardware and software behaviors — not just their co-execution. 
Key to our approach is the separation of external and internal design testbenches. We use 
a frequency interleaved scheduling foundation ideally suited to our approach because it 
allows unrestricted hardware and software modeling, a mix of untimed and timed 
software, and a layered approach using software s ... 

Keywords: computer system modeling and simulation, digital system design, 
hardware/software codesign 



51 Decentr alizin g execution of composite web services 

#Mangala Gowri Nanda, Satish Chandra, Vivek Sarkar 
October 2004 ACM SIGPLAN Notices , Proceedings of the 19th annual ACM SIGPLAN 
conference on Object-oriented programming, systems, languages, and 
applications OOPSLA '04, volume 39 issue 10 
Publisher: ACM Press 

Full text available: fg| pdf (429.43 KB) Additional Information: full citation, abstract , references , citings , index 



Distributed enterprise applications today are increasingly being built from services 
available over the web. A unit of functionality in this frannewprk is a web service, a 
software application that exposes a set of "typed" connections that can be accessed over 
the web using standard protocols. These units can then be composed into a 
<i>composite</i> web service. BPEL (Business Process Execution Language) is a high- 
level distributed programming language for creating composite web servi ... 

Keywords: business process, program dependence graph, work flow 



52 Models for performance perturbation analy sis 
Allen D. Malony, Daniel A. Reed 

December 1991 ACM SIGPLAN Notices , Proceedings of the 1991 ACI^/ONR workshop 

on Parallel and distributed debugging PADD '91, volume 26 issue 12 
Publisher: ACM Press 

Full text available: ^ pdf(1.16 MB) Additional Information: full citation , references , citin gs, index temns 




53 System- l evel power optimization: techniques and tools 
Luca Benin!, Giovanni de MIcheli 

V April 2000 ACM Transactions on Design Automation of Electronic Systems (TODAES), 
Volume 5 Issue 2 
Publisher: ACM Press 

Full text available- W[ Ddf(385 22 KB) Additional Information: full citation , abstract , refer ences, citings, index 
. i^AfH—v terms 

This tutorial surveys design nnethods for energy-efficient systenn-level design. We 
consider electronic sytems consisting of a hardware platform and software layers. We 
consider the three major constituents of hardware that consume energy, namely 
computation, communication, and storage units, and we review methods of reducing their- 
energy consumption. We also study models for analyzing the energy cost of software, and 
methods for energy-efficient software design and compilation. This survery ... 

54 Modeling cost/performance of a parallel connputer simulator 
^ Babak Falsafi, David A. Wood 

^ January 1997 ACM Transactions on Modeling and Computer Simulation (TOMACS), 

Volume 7 Issue 1 
Publisher: ACM Press 

Full text available* pdf(585 68 KB) A^^'*'^"^* Infomriation: full citation , references , citings, index temns . 

review 



Keywords: conservative discrete-event simulation, memory systems, performance 
analysis, shared-memory multiprocessors 



55 Formal Aspects and Distributed Systems: Modeling and simulation o f steady state 
^ and transient behaviors for emergent SoCs 
^ JoAnn M. Paul, Arne J. Suppe, Donald E. Thomas 

September 2001 Proceedings of the 14th International symposium on Systems 
synthesis ISSS '01 

Publisher: ACM Press 

Full text available: pdf(409.23 KB) Additional Information: full citation , abstract , referen ces, citings, index 
• kJ = terms 

We introduce a formal basis for viewing computer systems as mixed steady state and 
non-steady state (transient) behaviors to motivate novel design strategies resulting from 
simultaneous consideration of function, scheduling and architecture. We relate three 
design styles: hierarchical decomposition, static mapping and directed platform that have 
traditionally been separate. By considering them together, we reason that once a steady 
state system is mapped to an architecture, the unused processing ... 



Keywords: computer system modeling and simulation, hardware/software codesign. 
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(TOPLAS), Volume 20 Issue 6 
Publisher: ACM Press 

Full text available: ^ pdf(434.44 KB) Additional Information: full citation , abstract, references , citings , index 
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Many programnriing languages support either task parallelisnn, but few languages provide 
a uniform framework for writing applications that need both types of parallelism or data 
parallelism. We present a programming language and system that integrates task and 
data parallelism using shared objects. Shared objects may be stored on one processor or 
may be replicated. Objects may also be partitioned and distributed on several 
processors.Task parallelism is achieved by forking processes remotely a ... 

Keywords: data parallelism, shared objects, task parallelism 



57 Real-time shading 

Marc Olano, Kurt Akeley, John C. Hart, Wolfgang Heidrlch, Michael McCool, Jason L Mitchell, 
Randi Rost 

August 2004 ACM SIGGRAPH 2004 Course Notes SIGGRAPH '04 
Publisher: ACM Press 

Full text available: ^Ddf(7.39 MB) Additional Infonmatlon: full citation , abstract 

Real-time procedural shading was once seen as a distant dream. When the first version of 
this course was offered four years ago, real -time shading was possible, but only with one- 
of-a-kind hardware or by combining the effects of tens to hundreds of rendering passes. 
Today, almost every new computer comes with graphics hardware capable of interactively 
executing shaders of thousands to tens of thousands of instructions. This course has been 
redesigned to address today's real-time shading capabili ... 



58 Tarta n: e valuating spatial com putation for whole program exe cution 




Mahim Mishra, Timothy J. Callahan, Tiberiu Chelcea, Girish Venkataramani, Seth C. 
Goldstein, Mihai Budiu 



October 2006 ACM SIGOPS Operating Systems Review , ACM SIGARCH Computer 
Architecture News , ACM SIGPLAN Notices , Proceedings of tlie 12th 
international conference on Architectural support for programming 
languages and operating systems ASPLOS-XII, volume 40 , 34 , 4i issue 5,5, 
11 

Publisher: ACM Press 

Full text available: ^ pdf(318.86 KB) Additional Information: full citation , abstract , references , index terms 

Spatial Computing (SC) has been shown to be an energy-efficient model for implementing 
program kernels. In this paper we explore the feasibility of using SC for more than small 
kernels. To this end, we evaluate the performance and energy efficiency of entire 
applications on Tartan, a general -purpose architecture which integrates a reconfigurable 
fabric (RF) with a superscalar core. Our compiler automatically partitions and compiles an 
application Into an instruction stream for the core and a con ... 

Keywords: asynchronous circuits, dataflow machine, defect tolerance, low power, 
reconfigurable hardware, spatial computation 
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June 2004 ACM SIGPLAN Notices , Proceedings of the 2004 ACM SIGPLAN/SIGBED 
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LCTES '04, Volume 39 Issue 7 
Publisher: ACM Press 




Full text available: 
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terms 



The next generation of tools for ennbedded systems design will represent a common arena 
for several cooperating groups. These tools will permit system design at a high 
abstraction level and enable automatic refinement through several abstraction levels to 
obtain the final prototype. To facilitate this evolution, we propose a new .(Met Framework 
based system level modeling and simulation environment. This environment allows (1) 
cooperation — by enabling web-based design and multi-language features ... 

Keywords: .Net, C#, C+ + , CIL, ESys.Net, HDLs, Java, SystemC, SystemVerllog, VHDL, 
Verilog, attribute programming, component-based programming, embedded systems, 
framework, hardware/software codesign, modeling, simulation, system on chip 



60 Sc hed uling threads for low sp ace req uirement and good localit y 
Girija J. Narlikar 

June 1999 Proceedings of the eleventh annual ACM symposium on Parallel 

algorithms and architectures SPAA '99 
Publisher: ACM Press 

Full text available: Q pdf(1 .69 MB) Additional Information: full citation , references , citings , index terms 



Results 41 - 60 of 200 Result page: previous 123 4 5678910 next 

The ACM Portal is published by the Association for Computing Machinery. Copyright © 2007 ACM, Inc. 
Terms of UssQ e Privac y Policy Code of Ethics Contact Us 

Useful downloads: ^ Adobe Acrobat O QuickTime 1 ^ Windows Media P layer ^ Real Player 



t PORTAL 



USPTO 



Subscribe (Full Service). Register (Limited Service, Free) Login 

Search: ^ The ACM Digital Library C The Guide 
jthread partitioning based on cost model 



Terms used thread partitionin g based on cost model 



Sort results jrelevance 

Display 
results 



Save 



results to a Binder 



» , ^ Search Tips 

expanded form ^\ ^ ^ . 
> — -~ — ' □ Open results in a new 

window 



Feedback Report a problem Satisfaction 
survey 

Found 127,417 of 199,915 

Try an Advanced S e arch 

Try this search in The ACM Guide 



Results 61 - 80 of 200 
Best 200 shown 



Result page: previous 1 2345678 



9 10 next 
Relevance scale □□Hi 



61 Thread-based pro gra mming for the EM-4 h ybrid dataflow machine 

Mitsuhlsa Sato, Yuetsu Kodama, Shuichi Sakai, Yoshinori Yamaguchi, Yasuhito Koumura 
April 1992 ACM SIGARCH Computer Archi tecture News , Proceedings of the 19th 

annual international symposium on Computer architecture ISCA '92, volume 

20 Issue 2 
Publisher: ACM Press 

Additional Information: full citation , abstract , references , citing s, index 
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Full text" available: ^ pdf (967.67 KB) 



62 



In this paper, we present a thread-based programming model for the EM-4 hybrid 
dataflow machine, where parallelism and synchronization among threads of sequential 
execution are described explicitly by the programmer. Although EM-4 was originally 
designed as a dataflow machine, we demonstrate that It provides effective architectural 
support for a variety of programming styles, including message passing and distributed 
data sharing in imperative languages. Our approach allows the programmer t ... 

Time weaver: a software-through-models f r amework for embedded real-tinne systems 
Dionisio de Niz, Raj Rajkumar 

June 2003 ACi^ SIGPUVN Notices , Proceedings of tlie 2003 ACM SIGPLAN conference 
on Language, compiler, and tool for embedded systems LCTES '03, volume 

38 Issue 7 
Publisher: ACM Press 

Full text available- pdf( 467 76 KB) A^^'**®"^' Information: full citation , abstract , references , citings, index 

terms 

Embedded real-time systems are deployed in a wide range of application domains 
including transportation systems, automated manufacturing, process control, defense, 
aerospace, and telecommunications. These systems must satisfy not only logical 
functional requirements but also para-functional properties such as timeliness. Quality of 
Service (QoS) and reliability. The cross-cutting behaviors imposed by these para- 
functional properties and dependencies on operational characteristics (e.g. ha ... ' 

Keywords: couplers, embedded, real-time, semantic dimension, semantic separation, 
software-through-models 



63 The Amber s ystem: parallel programmin g o n a network of multiprocessors 
J. Chase, F. Amador, E. Lazowska, H. Levy, R. Littlefield 

November 1989 ACM SIGOPS Operating Systems Review , Proceedings of the twelfth 
ACM symposium on Operating systems principles SOSP '89, volume 23 
Issue 5 

Publisher: ACM Press 

Additional Information: fuH citation , abstract , references , citings , index 



Full text available: ^pdf d .53 MB) 
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This paper describes a programming system called Amber that permits a single 
application program to use a homogeneous network of computers In a uniform way. 



making the network appear to the application as an integrated multiprocessor. Amber is 
specifically designed for high performance in the case where each node in the network is 
a shared-memory multiprocessor. Amber shows that support for loosely-coupled 
multiprocessing can be efficiently realized using an obje ... 

64 Generation a nd q uantitative evaluation of dataflow clusters 
Lucas Roh, Walid A. Najjar, A. P. Wim Bohm 

July 1993 Proceedings of the conference on Functional programming languages and 

computer architecture FPCA '93 
Publisher: ACM Press 
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65 A Model for the Coanalysis of Hardware and Software Architectures 

Fred Rose, Todd Carpenter, Sanjaya Kumar, John Shackleton, Todd Steeves Honeywell 
March 1996 Proceedings of the 4th International Workshop on Hardware/Software 

Co-Design CODES '96 
Publisher: IEEE Computer Society 

Full text available: ^ . . ^ .^-os iMl 

^.BdfilJlMBjL^ Additional Information: fuli citation , abstract 
Publisher Site 

Successful multiprocessor system design for complex real-time embedded applications 
requires powerful and comprehensive, yet cost-effective, productive, and maintain able 
modeling. The multi-disciplinary, VHDL-based modeling library developed by the 
Honeywell Technology Center places heavy emphasis on multiprocessing and distributed 
communications. These models focus on detailed hardware performance analysis along 
with multiple abstraction levels for software representation and evaluation. This ... 

Keywords: VHDL, performance modeling, hardware/software codesign, RASSP 



66 Untrusted hosts and confidentiality: secure p rog ram partitioning 
Steve Zdancewic, Lantian Zheng, Nathaniel Nystrom, Andrew C. Myers 
October 2001 ACM SIGOPS Operating Systems Review , Proceedings of the eighteenth 

ACM symposium on Operating systems principles SOSP '01, volume 35 issue 
s . 
Publisher: ACM Press 

Full text available: tfl pdf (1.36MB) Additional Information: ful l citation , abstract , reference s, citings, index 

t erms 

This paper presents secure program partitioning, a language-based technique for 
protecting confidential data during computation in distributed systems containing mutually 
untrusted hosts. Confidentiality and integrity policies can be expressed by annotating 
programs with security types that constrain information flow; these programs can then be 
partitioned automatically to run securely on heterogeneously trusted hosts. The resulting 
communicating subprograms collectively implement the original p ... 

67 Concepts and p aradigms of object-oriented prog ramming 
Peter Wegner 

August 1990 ACM SIGPLAN OOr>S Messenger, volume i issue i 
Publisher: ACM Press 

Full text available: ^ pdf(5.52 MB) Additional Information: full citation , abstract , citings , index t erms 

We address the following questions for object-oriented programming: What is it?What are 
its goalsPWhat are its originsPWhat are its paradigms? What are its design alternatives? 
What are its models of concurrency?What are its formal computational models?What 
comes after object-oriented programming?Start\ng from software engineering goals, we 
examine the origins and paradigms of object-oriented programming, explore its language 
design alternativ ... 



68 The perform a nce of p-kernel-based systems 

Hermann Hartig, Michael Hohmuth, Jochen Liedtke, Sebastian Schonberg, Jean Wolter 
October 1997 ACM SIGOPS Operating Systems Review , Proceedings of the sixteenth 



ACM symposium on Operating systems principles SOSP '97, volume 3i issue 
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Publisher: ACM Press 
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69 Processor all ocation policies for messa ge- passing parallel computers 
^ Cathy McCann, John Zahorjan 

^ May 1994 ACM SIGMETRICS Performance Evaluation Review , Proceedings of the 
1994 ACM SIGMETRICS conference on Measurement and modeling of 
computer systems SIGMETRICS *94, volume 22 issue l 
Publisher: ACM Press 

Full text available: tvl pdf d.SOMB) Additional Infonmation: full citation , abstract , references , citings, index 
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When multiple jobs compete for processing resources on a parallel computer, the 
operating system kernel's processor allocation policy determines how many and which 
processors to allocate to each. In this paper we investigate the issues involved in 
constructing a processor allocation policy for large scale, message-passing parallel 
computers supporting a scientific workload. We make four specific contributions: We 
define the concept of efficiency preservat ... 

70 Researc h pape rs: test & analysis 1: Automated, contract-based user testing of 
^ commer cial-off-t he-shelf components 

^ Lionel C. Briand, Yvan Lablche, Michal M. Sowka 

May 2006 Proceeding of the 28th international conference on Software engineering 

ICSE '06 
Publisher: ACM Press 

Full text available: ^ pdf(173.17 KB) Additional Information: full citation , abstract , references , index terms 

Commercial-off-the-shelf (COTS) components provide a means to construct software 
(component-based) systems in reduced time and cost. In a COTS component software 
market there exist component vendors (original developers of the component) and 
component users (developers of the component-based systems). The former provide the 
component to the user without source code or design documentation, and as a result it is 
difficult for the latter to adequately test the component when deployed in their syst ... 

Keywords: COTS, UML, adequacy criteria, component 



71 Addressi ng d ynamic issues of program model checking 
Flavio Lerda, Willem Visser 

May 2001 Proceedings of the 8th international SPIN workshop on Model checking of 
software SPIN '01 

Publisher: Springer-Verlag New York. Inc. 

Full text available:^ pdf(21 3.42 KB) Additional Information: full citation , abstract , references , citings 

l^odel checking real programs has recently become an active research area. Programs 
however exhibit two characteristics that make model checking difficult: the complexity of 
their state and the dynamic nature of many programs. Here we address both these issues 
within the context of the Java Rath Finder (JPF) rnodel checker. Firstly, we will show how 
the state of a Java program can be encoded efficiently and how this encoding can be 
exploited to improve model checking. Next we show how to use sym .... 

72 Architectur e: Fast and fair: data-stream q uality of service 
Thomas Y. Yeh, Glenn Relnman 

September 2005 Proceedings of the 2005 international conference on Compilers, 

architectures and synthesis for embedded systems CASES '05 
Publisher: ACM Press 

Full text available: ^ pdf(318.10 KB) Additional Information: full citation , abstract , references , index terms 

Chip multiprocessors have the potential to exploit thread level parallelism, particularly in 
the context of embedded server farms where the available number of threads can be 
quite high. Unfortunately, both per-core and overall throughput are significantly impacted 




by the organization of the lowest level on-chip cache. On-chip caches for CMPs must be 
able to handle the increased demand and contention of multiple cores. To complicate the 
problem, cache demand changes dynamically with phases chang ... 

Keywords: CMP, NUCA, PDAS, Q05, adaptive, bandwidth, cache, chip multiprocessor, 
cluster, data-stream, distributed, embedded, memory wall, migration, non-uniform 
access, partition, per thread degradation, phase 



73 Research sessio ns: data integration: Ad a ptin g to source properties in processing 
^ data inte gration q ueries 

^ Zachary G. Ives, Alon Y. Halevy, Daniel S. Weld 

June 2004 Proceedings of the 2004 ACM SIGMOD international conference on 

Management of data SIGMOD *04 
Publisher: ACM Press 

Full text available: ^ pdfd 97.27 KB) Additional Information: full citation , abstract ; references, citinos 

An effective query optimizer finds a query plan that exploits the characteristics of the 
source data. In data integration, little is known in advance about sources' properties, 
which necessitates the use of adaptive query processing techniques to adjust query 
processing on-the-fly. Prior work in adaptive query processing has focused on 
compensating for delays and adjusting for mis-estimated cardinality or selectivity values. 
In this paper, we present a generalized architecture for adapt! v ... 

74 ' Balancing registe r allocation across threads for a multithreaded n etwork processor 
Xiaotong Zhuang, Santosh Pande 

June 2004 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 2004 conference 
on Programming language design and implementation PLDI '04, volume 39 

Issue 6 

Publisher: ACM Press 

Full text available: « Ddf(429.85 KB) Additional Information: full citation , abstract, references, citings, index 

terms 

Modern network processors employ multi-threading to allow concurrency amongst 
multiple packet processing tasks. We studied the properties of applications running on the 
network processors and observed that their imbalanced register requirements across 
different threads at different program points could lead to poor performance. Many times 
application needs demand some threads to be more performance critical than others and 
thus by controlling the register allocation across threads one could impa ... 

Keywords: multithreaded processor, network processor, register allocation 
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Publisher: ACM Press 

Full text available: ^ pdf(840.43 KB) Additional Information: full citation , abstract , citings, i ndex terms 

Recently, lightweight thread libraries have become a connmon entity to support concurrent 
programming on shared memory multiprocessors. However, the disparity between 
primitives offered by operating systems creates a challenge for those who wish to create 
portable lightweight thread packages. What should be the Interface between the machine- 
independent and machine-dependent parts of the thread library? We have implemented a 
portable lightweight thread library on top of Unix on a KSR-1 supercomput ... 

76 Partitionin g base d operating system: a formal model 
^ Lei Luo, Ming-Yuan Zhu 

July 2003 ACM SIGOPS Operating Systems Review, Volume 37 issue 3 
Publisher: ACM Press 

Full text available: ^ pdf(736.77 KB) Additional Information: full citation , abstract , references 



Automated aircraft control has traditionally been divided into distinct functions that are 
innplemented separately (e.g., autopilot, auto-throttle, flight management); each function 



has its own fault-tolerant computer system, and dependencies among different functions 
are generally limited to the exchange of sensor and control data. A by-product of this 
federated architecture is that faults are strongly contained within the computer system of 
the function where they occur and cannot readily propa ... 

77 Dynamicall y managing the communication-parallelism trade-off in future clustered 
^ proce ssors 

^ Rajeev Balasubramonian, Sandhya Dwarkadas, David H. Albonesi 

May 2003 ACM SIGARCH Computer Archi tecture News , Proceedings of the 30th 

annual international symposium on Computer architecture ISCA '03, volume 

31 Issue 2 
Publisher: ACM Press 

Full text available: Q pdf(206.34 KB) Additional Information: full citation , abstract , references , citings 

Clustered microarchitectures are an attractive alternative to large monolithic superscalar 
designs due to their potential for higher clock rates in the face of increasingly wire-delay- 
constrained process technologies. As increasing transistor counts allow an increase in the 
number of clusters, thereby allowing more aggressive use of Instruction-level parallelism 
(ILP), the inter-cluster communication increases as data values get spread across a wider 
area. As a result of the emergence of this tr ... 

78 Im pact of sharing-based thread placennent on multithreaded architectures 
^ R. Thekkath, S. J. Eggers 

April 1994 ACM SIGARCH Computer Architecture News , Proceedings of the 21ST 

annual international symposium on Computer architecture ISCA '94, volume 

22 Issue 2 

Publisher: IEEE Computer Society Press. ACIVI Press 

Full text available: glpdfn.33MB) Additional Information: full citation , abstract, references, citings, index 

terms 

Multithreaded architectures context switch between Instruction streams to hide memory 
access latency. Although this improves processor utilization, it can increase cache 
interference and degrade overall performance. One technique to reduce the Interconnect 
traffic is to co-locate threads that share data on the same processor. Multiple threads 
sharing in the cache should reduce compulsory and invalidation misses, thereby 
Improving execution time.To test this hypothesis, we compared a variety oft ... 

79 Physica l Ex perimentation with Prefetching Helper Threads on Intel's Hyper-Threaded 
Proces sors 

Dongkeun Kim, Steve Shih-wei Liao, Perry H. Wang, Juan del Cuvillo, Xinmin Jian, Xiang 
Zou, Hong Wang, Donald Yeung, Milind Girkar, John P. Shen 

March 2004 Proceedings of the international symposium on Code generation and 
optimization: feedbaclc-directed and runtime optimization CGO '04 

Publisher: IEEE Computer Society 

Full text available: ^ pdf(264.47 KB) Additional Information: full citation , abstract , citings, index terms 

Pre-execution techniques have received much attention as aneffective way of prefetching 
cache blocks to tolerate the ever-lncreasingmemory latency. A number of pre-execution 
techniquesbased on hardware, compiler, or both have been proposed andstudied 
extensively by researchers. They report promising resultson simulators that model a 
Simultaneous Multithreading (SMT) processor. In this paper, we apply the helper threading 
Idea ona real multithreaded machine, i.e., Intel Pentium 4 processor with Hyp ... 

80 Data and memory optimization techniques for embedded systenns 
p. R. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E. Brockmeyer, C. KulkarnI, A. 
Vandercappelle, P. G. Kjeldsberg 

April 2001 ACM Transactions on Design Automation of Electronic Systems (TODAES), 

Volume 6 Issue 2 
Publisher: ACM Press 

Full text available: fapdf(339.91 KB^ Additional Information: full citation , abstract, references , cltjngs. index 
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We present a survey of the state-of-the-art techniques used in performing data and 
nnennory-related optimizations in embedded systems. The optimizations are targeted 
directly or indirectly at the memory subsystem, and impact one or more out of three 



important cost naetrics: area, performance, and power dissipation of the resulting 
Implementation. We first examine architecture-independent optimizations in the form of 
code transoformatlons. We next cover a broad spectrum of optimizati ... 

Keywords: DRAM, SRAM, address generation, allocation, architecture exploration, code 
transformation, data cache, data optimization, high-level synthesis, memory architecture 
customization, memory power dissipation, register file, size estimation, survey 
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81 Evaluatin g network processing efficiency with processor partitioning and 
^ asynchron ous I/O 

^ Tim Brecht, G. (John) Janakiraman, Brian Lynn, Vikram Saletore, Yoshio Turner 
April 2006 ACM SIGOPS Operating Systems Review , Proceedings of the 2006 

EuroSys conference EuroSys '06, volume 40 issue 4 
Publisher: ACM Press 

Full text available: ^ pdft521.28 KB) Additional Information: full citation , abstract , references , index terms 

Applications requiring high-speed TCP/IP processing can easily saturate a modern server. 
We and others have previously suggested alleviating this problem In multiprocessor 
environments by dedicating a subset of the processors to perform network packet 
processing. The remaining processors perform only application computation, thus 
eliminating contention between these functions for processor resources. Applications 
Interact with packet processing engines (PPEs) using an asynchronous I/O (AIG) prog ... 



Keywords: TCP/IP, asynchronous I/O, network processing 
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MPI is a message-passing standard widely used for developing high-performance parallel 
applications. Because of the restriction in the MPI computation model, conventional 
implementations on shared memory machines map each MPI node to an OS process, 
which suffers serious performance degradation in the presence of multiprogramming, 
especially when a space/time sharing policy is employed in OS job scheduling. In this 
paper, we study complle-time and run-time support for MPI by using threads and dem ... 
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This article explores memory sharing and protection support in Opal, a single-address- 
space operating system designed for wide-address (64-bit) architectures. Opal threads 
execute within protection domains in a single shared virtual address space. Sharing is 
simplified, because addresses are context independent. There is no loss of protection, 



^ because addressability and access are independent; the right to access a segment is 

♦ determined by the protection domain in which a thread executes. T ... 

Keywords: 64-bit architectures, capability-based systems, microkernel operating 
systems, object-oriented database systems, persistent storage, protection, single- 
address-space operating systems, wide-address architectures 
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In this paper, we present a functional partitioning method for low power real-time 
distributed embedded systems whose constituent nodes are systems-on-a-chip (SOCs). 
The system-level specification is assumed to be given as a set of task graphs. The goal Is 
to partition the task graphs so that each partitioned segment Is Implemented as an SOC 
and the embedded system is realized as a distributed system of SOCs. Unlike most 
previous synthesis and partitioning tools, this technique merges partition! ... 

Keywords: functional partitioning, SOC synthesis, genetic algorithm 
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Many high-level parallel programming languages allow for fine-grained parallelism. As in 
the popular work -time framework for parallel algorithm design, programs written in such 
languages can express the full parallelism in the program without specifying the mapping 
of program tasks to processors. A common concern In executing such programs is to 
schedule tasks to processors dynamically so as to minimize not only the execution time, 
but also the amount of space (memory) needed. Without caref ... 
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It is difficult to map the execution model of multithreading languages (languages which 
support fine-grain dynamic thread creation) onto the single stack execution model of C. 
Consequently, previous work on efficient multithreading uses elaborate frame formats and 
allocation strategy, with compilers customized for them. This paper presents an 
alternative cost-effective implementation strategy for multithreading language's which can 
maximally exploit current sequential C compilers. We identify as... 
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Recent advances in static program analysis have made it possible to detect errors in 
applications that have been thoroughly tested and are in wide-spread use. The ability to 
find errors that have eluded traditional validation methods Is due to the development and 
combination of sophisticated algorithmic techniques that are embedded in the 
implementations of analysis tools. Evaluating new analysis techniques is typically 
performed by running an analysis tool on a collection of subject programs, p ... 
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We present a new technique for partitioning processes in distributed embedded systems. 
Our heuristic algorithm minimizes both context switch and communication overhead 
under real-time deadline and process size constraints; it also tries to allocate functions to 
processors which are well-suited to that function. The algorithm analyzes the sensitivity of 
the latency of the task graph to changes in vertices hierarchical clustering, splitting and 
border adjusting. This algorithm can be used for init ... 
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Using Java In ennbedded systems is plagued by problems of limited runtime performance 
and unpredictable runtime behavior. The Java Multi-Threaded Processor (JMTP) provides 
solutions to these problems. The JMTP architecture is a single chip containing an off-the- 
shelf general purpose processor core coupled with an array of Java Thread Processors 
(JTPs). Performance can be Improved using this architecture by exploiting coarse-grained 
parallelism in the application. These performance im ... 
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Various control algorithms are used in autonomic control to maintain Quality of Service 
(QoS) and Service Level Agreements (SLAs). Controllers are all based to some extent on 
models of the relationship between resources, QoS measures, and the workload imposed 
by the environment. This work discusses the range of algorithms with an emphasis on 
richer and more powerful models to describe non-linear performance relationships, and 
strong interactions among the system resources. A hierarchical franiewo ... 
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The overhead of context switching limits efficient scheduling of multiple concurrent 
threads on a uniprocessor when real-time requirements exist. A software-implemented 
protocol controller may be crippled by this problem. The available idle time may be too 
short to recover through context switching, so only the primary thread can execute during 
message activity, slowing the secondary threads and potentially missing deadlines. 
Asynchronous software thread integration (ASTI) uses coroutine calls a ... 

Keywords: Asynchronous software thread integration, J1850, fine-grain concurrency, 
hardware to software migration, software-implemented communication protocol 
controllers 
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The continuation of the rennarkable exponential Increases in processing power over the 
recent past faces imminent challenges due in part to the physics of deep-submicron CMOS 
devices and the costs of both chip masks and future fabrication plants. A promising 
solution to these problems is offered by ah alternative to CMOS-based computing, 
chemically assembled electronic nanotechnology (CAEN). 



In this paper we outline how CAEN-based computing can become a reality. We briefly 
describe rec ... 
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Dataraces In multithreaded programs often indicate severe bugs and can cause 
unexpected behaviors when different thread interleavlngs are executed. Because 
dataraces are a cause for concern, many works have dealt with the problem of detecting 
them. Works based on dynamic techniques either report errors only for dataraces that 
occur in the current interleaving, which limits their usefulness, or produce many spurious 
dataraces. Works based on model checking search exhaustively for dataraces and th ... 
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A simultaneous multithreading (SMT) processor can issue instructions from several 
threads every cycle, allowing it to effectively hide various instruction latencies; this effect 
increases with the number of simultaneous contexts supported. However, each added 
context on an SMT processor incurs a cost in complexity, which may lead to an increase in 
pipeline length or a decrease in the maximum clock rate. This paper presents new designs 
for multithreaded processors which combine a conservative SMT ... 
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Requirements interaction management (RIM) is the set of activities directed toward the 
discovery, management, and disposition of critical relationships among sets of 
requirements, which has become a critical area of requirements engineering. This survey 
looks at the evolution of supporting concepts and their related literature, presents an 
issues-based framework for reviewing processes and products, and applies the framework 
in a review of RIM state-of-the-art. Finally, it presents seven researc ... 
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Large multimedia document archives may hold a major fraction of their data in tertiary 
storage libraries for cost reasons. This paper develops an Integrated approach to the 
vertical data migration between the tertiary, secondary, and primary storage in that it 
reconciles speculative prefetching, to mask the high latency of the tertiary storage, with 
the replacement policy of the document caches at the secondary and primary storage 
level, and also considers the interaction of these policies with ... 

Keywords: Caching, Markov chains. Performance, Prefetching, Scheduling, Stochastic 
modeling, Tertiary storage 
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